Ho Chi Minh City Temperature Forecasting
Group 2 - Analysis & Implementation

  • This notebook presents a comprehensive, methodologically rigorous approach to forecasting the daily average temperature in Ho Chi Minh City for the next five days
  • The process documented here follows the project brief from data sourcing and deep exploratory analysis to the final, interpretable champion model
  • Every decision is justified with data-driven evidence, prioritizing sound methodology to ensure a robust and reliable outcome
  • Please use the Outline/Table of Contents feature (on the left sidebar if using VS Code/GG Colab, right side if on Kaggle, etc.) to navigate this notebook efficiently
  • All plots are interactive, meaning we can (and should) zoom in, out, pan around, hover the cursor to show the value for each data point, etc.

0. Setup & Global Configuration

0. Setup & Global Configuration

In [1]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
import warnings
import os
import joblib
from typing import List, Dict, Tuple, Any

# --- Machine Learning ---
from sklearn.model_selection import TimeSeriesSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.linear_model import MultiTaskLassoCV
import skl2onnx
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as rt

# --- Visualization & Utilities ---
import plotly.io as pio
from rich.console import Console
from rich.table import Table

# --- Custom Project Modules ---
import utils.custom_plotly_style_te as cpte
import utils.eda_visualizations as eda
import utils.analysis_helpers as ah
import utils.evaluations as eval
import utils.diagnostics as diag
import utils.training as train
import utils.drift_detection as dd
import utils.hourly_processing as hp

0.1. Centralized Pipeline Configuration

0.1. Centralized Pipeline Configuration

In [2]:
class PipelineConfig:
    """
    A centralized configuration class for managing paths, parameters, and flags
    throughout the notebook. Reproducibility? Yea. Quick and easy key settings
    modifications? Also yea:)
    """

    # --- Path & File Configuration ---
    DAILY_DATA_PATH: str = "data/weather_hcm_daily.csv"
    HOURLY_DATA_PATH = "data/weather_hcm_hourly.csv"
    UTILS_DIR: str = "utils"

    # --- General Model Parameters ---
    GLOBAL_RANDOM_SEED: int = 105
    HORIZONS: List[int] = [1, 2, 3, 4, 5] # Predict from t+1 to t+5
    TARGET_VARIABLE: str = "temp"

    # --- Data Split Parameters ---
    # Using an 85/15 split for a substantial test set. We use TimeSeriesSplit CV so can do without three-way holdout
    TEST_SIZE: float = 0.15


config = PipelineConfig()
os.makedirs(config.UTILS_DIR, exist_ok=True)
console = Console()
In [3]:
warnings.filterwarnings("ignore")

pio.templates["te"] = cpte.te_style_template
pio.templates.default = "te"
pd.options.plotting.backend = "plotly"
pio.renderers.default = "notebook+svg"
# svg_renderer = pio.renderers["svg"]
# svg_renderer.width = 1400
# svg_renderer.height = 900

Q1. Data Sourcing

Q1. Data Sourcing

This section details the methodology used to collect the historical weather data for Ho Chi Minh City, which forms the foundation of this forecasting project.

  • Data source: Visual Crossing Weather History API, per the project requirements.
  • Location specificity: It was critical to use the exact location string "Hồ Chí Minh city" to retrieve data for the entire metropolitan area. Similar queries like "Ho Chi Minh" or "Sai Gon" pointed to more localized or slightly different coordinates, which could introduce inconsistencies.
  • Time period: We collected daily weather data spanning from January 1, 2015, to October 1, 2025. This provides a substantial ~10-year period, capturing a wide range of seasonal variations and long-term trends.
  • Feature selection: All available weather metrics were selected, with the logical exception of snow and snowdepth, which are irrelevant for HCMC's tropical climate.
  • Collection strategy:
    • Daily data (3,927 records): Due to the manageable size, the daily dataset was collected using a manual download approach, requiring two accounts to bypass daily limits.
    • Hourly data (94,248 records): For the much larger hourly dataset, manual collection was deemed impractical and error-prone. We developed a robust, automated Python script leveraging the Visual Crossing API.
    • Automated script features:
      • Designed for resilience and data integrity
      • Featuring automated API key rotation from a pool of 25+ keys from our group members (employed some techniques to bypass account creation limit) and automatic retries
      • Auto-backup of the current central downloaded file before performing actions for safety measures
      • Intelligent batch-based downloading that resumed from the last known timestamp in the central downloaded file, for convenience and maximizing the 1000 records/day/free account API key rate limit
      • Built-in data integrity checks to detect duplicates or malformed records

$\implies$ The automated process ensured the final dataset was both complete and of high quality. The data was ready to use instantly for the daily one, and after 4 days for the hourly dataset.

Q2. Data Understanding & Exploratory Data Analysis (EDA)

Q2. Data Understanding & Exploratory Data Analysis (EDA)

We must understand the data, deeply, before we can proceed to engineer more features and starts modeling. This EDA phase serves two primary purposes:

  1. Assess data quality: To identify and understand issues like missing values, outliers, and redundancies that must be addressed during preprocessing.
  2. Uncover patterns & justify decisions: To discover the underlying patterns, seasonality, and relationships within the data. Every significant feature engineering and modeling choice made later in this notebook will be explicitly justified by a finding from this section.

Q2.1. Initial Data Load & Quality Audit

Q2.1. Initial Data Load & Quality Audit

In [4]:
# Load the dataset from the configured path
df_raw = pd.read_csv(config.DAILY_DATA_PATH)

# Convert 'datetime' column to datetime objects and set as the index
df_raw["datetime"] = pd.to_datetime(df_raw["datetime"])
df_raw.set_index("datetime", inplace=True)

# --- Perform a high-level data audit ---
console.print("--- [bold]DataFrame Info[/bold] ---")
df_raw.info(verbose=False, memory_usage="deep")

console.print("\n--- [bold]Descriptive Statistics (Numerical Features)[/bold] ---")
display(df_raw.describe().T)

console.print("\n--- [bold]Missing Value Percentages[/bold] ---")
missing_values = df_raw.isnull().sum()
missing_percentage = (missing_values / len(df_raw) * 100).round(2)
missing_df = pd.DataFrame({
    "Missing Count": missing_values,
    "Missing Percentage": missing_percentage,
}).sort_values(by="Missing Percentage", ascending=False)

console.print(missing_df[missing_df["Missing Count"] > 0])
--- DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3927 entries, 2015-01-01 to 2025-10-01
Columns: 37 entries, name to source
dtypes: float64(25), int64(2), object(10)
memory usage: 3.7 MB
--- Descriptive Statistics (Numerical Features) ---
count mean std min 25% 50% 75% max
latitude 3927.0 10.776000 0.000000e+00 10.776 10.776 10.776 10.776 10.776
longitude 3927.0 106.701000 1.421266e-14 106.701 106.701 106.701 106.701 106.701
tempmax 3927.0 33.071683 1.791789e+00 24.600 32.000 33.000 34.000 39.000
tempmin 3927.0 25.136949 1.586575e+00 18.000 24.000 25.000 26.000 30.000
temp 3927.0 28.445837 1.386448e+00 22.800 27.500 28.400 29.300 33.000
feelslikemax 3927.0 38.463305 3.247412e+00 24.600 36.400 38.800 40.600 48.700
feelslikemin 3927.0 25.830736 2.943036e+00 18.000 24.000 25.000 26.000 40.000
feelslike 3927.0 31.671760 2.958532e+00 22.800 29.700 31.400 33.500 42.000
dew 3927.0 23.503896 2.242134e+00 12.700 22.400 24.200 25.000 27.900
humidity 3927.0 76.570512 9.681565e+00 49.500 69.700 77.500 84.200 99.400
precip 3927.0 5.208251 1.237959e+01 0.000 0.100 0.800 5.000 227.200
precipprob 3927.0 75.986758 4.272181e+01 0.000 100.000 100.000 100.000 100.000
precipcover 3927.0 10.342480 1.321839e+01 0.000 4.170 4.170 12.500 100.000
windgust 3927.0 31.897428 1.051264e+01 8.300 24.800 29.900 36.700 216.000
windspeed 3927.0 19.462210 5.504470e+00 7.600 15.800 18.400 22.300 50.000
windspeedmax 3927.0 19.462210 5.504470e+00 7.600 15.800 18.400 22.300 50.000
windspeedmean 3927.0 10.029870 3.400972e+00 3.200 7.400 9.400 12.300 23.000
windspeedmin 3927.0 3.253603 2.866895e+00 0.000 1.400 2.300 4.900 15.700
winddir 3927.0 185.397479 8.071268e+01 0.000 127.750 180.900 251.600 359.700
sealevelpressure 3927.0 1009.114770 2.094073e+00 1001.500 1007.700 1009.000 1010.500 1016.200
cloudcover 3927.0 55.594321 1.359861e+01 9.700 46.450 55.000 64.800 100.000
visibility 3927.0 9.614719 7.783575e-01 5.400 9.200 9.800 10.200 20.000
solarradiation 3927.0 213.147670 4.969988e+01 0.000 185.800 221.200 250.000 315.800
solarenergy 3927.0 18.403489 4.296753e+00 0.000 16.100 19.100 21.600 27.200
uvindex 3927.0 7.803667 1.494908e+00 0.000 7.000 8.000 9.000 10.000
severerisk 1196.0 27.604515 1.737145e+01 5.000 10.000 30.000 30.000 75.000
moonphase 3927.0 0.483990 2.887589e-01 0.000 0.250 0.500 0.750 0.980
--- Missing Value Percentages ---
            Missing Count  Missing Percentage
severerisk           2731               69.54
preciptype            924               23.53

Q2.2. Feature Definitions & Domain Context

Q2.2. Feature Definitions & Domain Context

Feature Group Feature Name Data Type Value Range (numerical) |
Example Values (categorical)
Definition & How it may helps predicting target variable temp
Identifier/Metadata name, address... object ['Hồ Chí Minh city'] Constant metadata identifying the location. Action: Will be dropped.
datetime object 2015-01-01 to 2025-10-01 The unique timestamp for each daily record. Action: Will be converted to the DataFrame index.
Target & Temp. temp float64 22.8 - 33.0 (°C) Primary Target Variable. The average daily temperature.
tempmax/tempmin float64 24.6-39.0 / 18.0-30.0 (°C) The daily maximum and minimum temperatures. Key drivers of the daily average.
feelslike... float64 e.g., 22.8 - 42.0 (°C) The perceived temperature, accounting for humidity and wind. Highly correlated with temp.
Atmospheric humidity float64 49.5 - 99.4 (%) Relative humidity. Crucial in a tropical climate as high humidity traps heat, especially impacting overnight lows.
dew float64 12.7 - 27.9 (°C) Dew Point. The temperature at which air becomes saturated. A direct measure of atmospheric moisture.
sealevelpressure float64 1001.5 - 1016.2 (hPa) Air pressure adjusted to sea level. Changes can indicate shifting weather patterns (e.g., low pressure is associated with storms).
cloudcover float64 9.7 - 100.0 (%) The percentage of the sky covered by clouds. Directly impacts solar radiation reaching the surface.
visibility float64 5.4 - 20.0 (km) The distance at which objects can be clearly seen. Often reduced by rain or high humidity.
Precipitation precip float64 0.0 - 227.2 (mm) Total daily precipitation. Defines the wet vs. dry seasons, which are central to HCMC's climate.
precipprob int64 0 or 100 (%) Probability of precipitation. In this dataset, there's only 2 values, can be understood as 0 or 1.
preciptype object ['rain', nan] The type of precipitation. NaN values deterministically correspond to precip = 0. Action: Impute NaN with 'none'.
Wind windspeed float64 7.6 - 50.0 (km/h) The maximum sustained wind speed for the day.
windgust float64 8.3 - 216.0 (km/h) The highest instantaneous wind speed recorded. Can indicate storm activity.
winddir float64 0.0 - 359.7 (°) The direction from which the wind originates. Seasonal wind patterns (monsoons) are key climate drivers.
Solar & Celestial solarradiation float64 0.0 - 315.8 (W/m²) The amount of solar energy reaching the surface. The primary driver of daytime temperature.
solarenergy float64 0.0 - 27.2 (MJ/m²) The total solar energy over the day. Highly correlated with solarradiation.
uvindex int64 0 - 10 The strength of ultraviolet radiation. A proxy for sun intensity and lack of cloud cover.
sunrise/sunset object 2015-01-01T06:11:22 Timestamps for sunrise and sunset. Action: Can be engineered into a daylight_hours feature.
moonphase float64 0.0 (new) - 0.98 (waning) The fraction of the moon that is illuminated. May have a subtle, long-term cyclical influence.
Text & Categorical conditions object ['Partially cloudy', 'Rain...'] A high-level summary of the day's weather.
description object ['Partly cloudy throughout...'] A more detailed, human-readable description of the weather. Action: Will be processed using NLP techniques.
icon object ['partly-cloudy-day', 'rain'] A machine-readable categorical summary of conditions.
Risk severerisk float64 5.0 - 75.0 A numerical rating for the risk of severe weather events. NaN indicates no recorded risk. Action: Impute NaN with 0.

Q2.3. Deep Dive: Key Predictive Feature Groups

Q2.3. Deep Dive: Key Predictive Feature Groups

The Solar Radiation & Cloud Cover Nexus¶

  • Physical Principle: Solar radiation is the primary engine of Earth's climate, representing the incoming energy that heats the surface. Cloud cover acts as the primary regulator of this energy, reflecting a significant portion back into space before it can be absorbed.
  • Measurement:
    • solarradiation is measured in Watts per square meter (W/m²), quantifying the instantaneous power of solar energy reaching a given area.
    • cloudcover is a percentage, often estimated from satellite imagery or ground-based observations.
  • Predictive Hypothesis:
    • We hypothesize a strong positive correlation between solarradiation and tempmax. Days with high, direct solar radiation will almost certainly be the hottest.
    • Conversely, we expect a strong negative correlation between cloudcover and temp. Higher cloud cover will suppress daytime heating, leading to cooler average temperatures. This relationship is fundamental to HCMC's climate, where the difference between a clear day in the dry season and an overcast day in the rainy season is pronounced.

The Atmospheric Moisture Duo: Humidity & Dew Point¶

  • Physical Principle: These two features quantify the amount of water vapor in the atmosphere, which plays a crucial role in temperature regulation via the greenhouse effect. Water vapor traps outgoing longwave radiation, preventing heat from escaping, especially at night.
  • Measurement:
    • humidity is relative—it's the percentage of water vapor in the air compared to the maximum amount it could hold at that temperature.
    • dew (Dew Point) is an absolute measure. It is the temperature at which the air would become 100% saturated. A higher dew point means there is more actual moisture in the air, regardless of the current temperature.
  • Predictive Hypothesis:
    • In HCMC's consistently warm climate, the Dew Point will be a more robust predictor than Relative Humidity.
    • We expect a strong positive correlation between dew and tempmin (overnight low). High atmospheric moisture (high dew point) acts like a blanket, preventing nighttime temperatures from dropping significantly. This is a key reason why humid nights in HCMC feel so warm.

The Precipitation System: A Cooling & Lagged Effect¶

  • Physical Principle: Precipitation has multiple effects on temperature. The act of evaporation from wet surfaces actively cools the air (evaporative cooling). Additionally, the cloud systems that produce rain block solar radiation.
  • Measurement: precip is the total volume of rainfall, measured in millimeters (mm).
  • Predictive Hypothesis:
    • The immediate effect of precip on the same day's temp will be negative. Rainy days are cooler days.
    • More importantly, we hypothesize a lagged effect. A day with heavy rainfall will increase ground moisture, leading to higher humidity and dew points on subsequent days. Therefore, precip from a few days prior may positively influence the tempmin of the current day. This justifies the creation of lagged and rolling-sum precipitation features.

Wind Dynamics: Direction & Speed¶

  • Physical Principle: Wind influences temperature through advection (transporting air masses) and evaporative cooling.
  • Measurement:
    • windspeed (km/h) measures the rate of air movement.
    • winddir (degrees) indicates the origin direction of the wind.
  • Predictive Hypothesis:
    • windspeed will have a modest cooling effect, especially on the feelslike temperature.
    • winddir is potentially a powerful but complex feature. In HCMC, wind direction is strongly tied to the two monsoon seasons. The Southwest Monsoon (May-Nov) brings moist, rain-bearing winds from the ocean, while the Northeast Monsoon (Dec-Apr) brings drier, cooler air from the continent. We hypothesize that winddir can act as a strong seasonal indicator, but its circular nature (359° is close to 0°) means it must be transformed into cyclical (sine/cosine) features to be used effectively by the model.

Q2.4. Target Variable Analysis: Trends & Seasonality

Q2.4. Target Variable Analysis: Trends & Seasonality

In [5]:
fig_target_timeseries = eda.plot_target_variable_time_series(df=df_raw, target_col=config.TARGET_VARIABLE)
fig_target_timeseries.show()
  • 1. Strong & predictable annual Seasonality:

    • The 30-day rolling mean (blue line) clearly reveals a consistent, repeating annual cycle of peaks and troughs. This strong seasonal pattern is the most dominant signal in the data.
    • This observation confirms our hypothesis that time-based features are essential. The model must be given information about the time of year to capture this cycle.
  • 2. Evidence of a Long-Term Upward Trend:

    • The 365-day rolling mean (red line) filters out the seasonal fluctuations, exposing a subtle but persistent upward drift in the baseline temperature over the 10-year period.
    • This trend, likely reflecting broader climate change patterns, indicates that the data is non-stationary. A simple model that only learns seasonality would degrade over time. Our model must also account for this long-term evolution.
  • 3. Fluctuating Volatility (Noise):

    • The lower panel, showing the rolling standard deviation, quantifies the data's volatility. The temperature is not equally predictable year-round.
    • We can observe periods of higher volatility (wider swings in temperature), particularly during the transitional periods between the dry and wet seasons. This suggests that the model's accuracy may naturally vary depending on the time of year.

$\implies$ Based on these three distinct components (seasonality, trend, noise), we establish the following non-negotiable feature engineering requirements:

  • To capture seasonality: We will engineer cyclical features (e.g., sine/cosine transformations of day_of_year and month). These allow the model to understand the cyclical nature of time, where December is "close" to January.
  • To capture trend: We will include a year feature or other trend-aware components to allow the model to account for the gradual increase in baseline temperature over time.
  • To capture recent history & noise: We will create lagged features (e.g., temperature from the previous day) and rolling window features (e.g., the average temperature over the last 7 days). These provide the model with a "memory" of the recent system state, helping it to predict short-term fluctuations around the seasonal and trend lines.

Next, we will perform a deeper dive into the seasonal component to better understand the specific climate dynamics of Ho Chi Minh City throughout the year.

Q2.5. Seasonal Pattern Deep Dive: The Monsoon Cycle

Q2.5. Seasonal Pattern Deep Dive: The Monsoon Cycle

In [6]:
fig_monthly_seasonality = eda.plot_monthly_seasonality(df=df_raw, target_col=config.TARGET_VARIABLE)
fig_monthly_seasonality.show()
  • Primary Peak (Pre-monsoon Heat): The hottest period of the year occurs in April and May. This is not mid-summer but rather the end of the dry season, a period characterized by intense, building heat and humidity just before the onset of the rainy Southwest Monsoon.

  • Mid-year Dip (Monsoon Cooling): A noticeable dip in the median temperature and a compression of the temperature range occurs from June through August. This corresponds to the peak of the rainy season. The increased cloud cover and frequent rainfall have a significant cooling effect, suppressing daytime maximum temperatures.

  • Coolest Period (Dry Season): The coolest months are December and January. This coincides with the Northeast Monsoon, which brings comparatively drier and cooler air masses to the region.

  • Outlier Confirmation: The plot also provides clear visual confirmation of low-temperature outliers, particularly in the cooler months of December and January. These represent unusual cold snaps that a robust model must be able to handle without its feature scaling being skewed.

$\implies$ Feature Engineering:

  1. This complex, non-sinusoidal pattern suggests creating cyclical features (sin_day_of_year, cos_day_of_year). A simple numerical month feature would be insufficient, as it cannot learn that May is climatically closer to June than it is to December.
  2. The presence of outliers provides further, conclusive evidence for our decision to use RobustScaler during preprocessing.

To mathematically isolate these components, we will now perform an STL (Seasonal-Trend-Loess) decomposition.

Q2.6. STL Decomposition: Isolating Key Signals

Q2.6. STL Decomposition: Isolating Key Signals

In [7]:
# fig_stl = eda.plot_stl_decomposition(df=df_raw, target_col=config.TARGET_VARIABLE)
fig_stl = pio.read_json("./assets/plots/fig_stl_decomposition.json")
fig_stl.show()

The STL decomposition provides a mathematical separation of the time series, allowing us to analyze each component in isolation. We use an additive model (Observed = Trend + Seasonal + Residual) because the magnitude of the seasonal swings does not appear to change proportionally with the long-term trend; the annual temperature variation is relatively constant.

  • Trend Component:

    • This panel confirms the non-stationary nature of the data with a clear, non-linear trend.
    • The baseline temperature shows a distinct rise and fall over multi-year periods, with a notable acceleration in the upward trend from mid-2023 onwards.
    • Modeling Implication: This complex trend confirms that a simple linear year feature will be insufficient. Our model must be flexible enough to capture this dynamic baseline, reinforcing the value of tree-based models and rich temporal features.
  • Seasonal Component:

    • This successfully isolates the strong, repeating annual pattern, which is the dominant driver of temperature variation.
    • As observed, the pattern is highly consistent through the middle of the dataset (approx. 2016-2024). The slight variations in amplitude at the very beginning and end of the series could be attributed to specific climate phenomena in those years or edge effects of the decomposition algorithm.
    • Modeling Implication: This component's stability and strength underscore the critical importance of the cyclical features we plan to engineer.
  • Residual Component:

    • This represents the "noise" or randomness left after the predictable trend and seasonal components have been removed.
    • Crucially, the residuals are centered around zero and display no obvious leftover patterns. This indicates that the decomposition has effectively captured the primary signals.
    • Modeling Implication: The residuals represent the day-to-day weather fluctuations our model will aim to predict using features like recent lags, rolling windows, and atmospheric conditions (cloudcover, precip, etc.).

$\implies$ Conclusion: The decomposition validates our initial analysis. The temperature is a composite of a dynamic long-term trend, a powerful seasonal cycle, and short-term noise. Our feature engineering strategy must explicitly provide the model with tools to address each of these components individually.

Q2.7. Predictor Analysis: Distribution, Skew, & Outliers

Q2.7. Predictor Analysis: Distribution, Skew, & Outliers

Having thoroughly analyzed the target variable, we now turn to the key predictors. Understanding their statistical distributions is essential for making sound preprocessing decisions. Features with significant skew or extreme outliers can disproportionately influence certain models and require robust handling.

In [8]:
# Select a representative subset of key numerical features - a mix of atmospheric, precipitation, and solar variables
features_to_plot = [
    "humidity",
    "dew",
    "sealevelpressure",
    "cloudcover",
    "solarradiation",
    "uvindex",
    "precip",
    "windspeed",
    "windgust",
]

feature_color_map = {
    feature: pio.templates["te"].layout.colorway[i % len(pio.templates["te"].layout.colorway)]
    for i, feature in enumerate(features_to_plot)
}

fig_distributions = eda.plot_feature_distributions(df=df_raw, features=features_to_plot, color_map=feature_color_map)
fig_distributions.show()

We can categorize the features into three distinct groups:

  • 1. Approximately Symmetric Distributions:

    • Humidity, Sealevelpressure, and Cloudcover exhibit largely symmetric, well-behaved distributions. For these features, the mean (dashed line) is very close to the median (solid line), indicating a lack of significant skew. While some have outliers on both ends, they don't appear to heavily distort the central tendency.
  • 2. Moderately Skewed Distributions:

    • Dew, Solarradiation, and Windspeed show a noticeable skew. Dew and Solarradiation are moderately left-skewed (a longer tail towards lower values), while Windspeed is moderately right-skewed. This asymmetry is confirmed by the divergence of their mean and median values.
  • 3. Extremely Skewed, Outlier-Dominated Distributions:

    • Precip and Windgust represent the most challenging features. The distributions are heavily compressed near zero (for Precip), with long, thin tails representing rare but extreme events (heavy rainfall, strong gusts). The vast majority of data points are concentrated in a very small range, with a few extreme outliers dramatically extending the feature's scale. The uvindex also shows a strong left skew, compounded by its discrete, integer-based nature.

$\implies$ Conclusion on Scaling Strategy: Using a standard mean-and-variance scaler (StandardScaler) would be methodologically unsound. The extreme outliers in Precip and Windgust would massively distort the mean and standard deviation, effectively "squashing" the informational variance of the other 99% of data points.

Therefore, the use of RobustScaler is better suited. By scaling data based on the median and interquartile range (IQR), RobustScaler is inherently resistant to the influence of extreme outliers, ensuring that all features are scaled on a consistent and meaningful basis.

Q2.8. Bivariate & Seasonal Relationship Analysis

Q2.8. Bivariate & Seasonal Relationship Analysis

In [9]:
pairplot_features = [
    "temp",
    "dew",
    "precip",
    "humidity",
    "windspeed",
    "solarradiation",
    "cloudcover",
]

fig_pairplot = eda.plot_seasonal_pairplot(df=df_raw, features=pairplot_features)
fig_pairplot.show()
  • 1. Diagonals Confirm Strong Seasonal Shifts:

    • The seasonal density plots on the diagonal clearly show that the distributions of key variables change dramatically between seasons.
    • During the Rainy Season (blue), Dew, Humidity, and Cloudcover are all significantly higher and more concentrated than in the Dry Season (yellow). Precipitation shifts from being almost entirely zero to having a long tail of rain events.
    • This confirms these variables are powerful indicators of the prevailing monsoon season and will be critical for the model.
  • 2. Solar Radiation & Cloud Cover: The Primary Drivers:

    • The scatter plots for Temp vs. Solarradiation and Temp vs. Cloudcover show the strongest, most consistent relationships.
    • There is a clear positive linear trend between Solarradiation and Temp: more sun results in higher temperatures.
    • Conversely, there is a strong negative linear trend between Cloudcover and Temp: more cloud cover leads to cooler temperatures.
  • 3. The Less Pronounced Role of Humidity and Dew Point:

    • The Temp vs. Dew plot shows a positive relationship; higher absolute moisture is associated with higher temperatures, likely due to the "blanket effect" trapping heat.
    • The Temp vs. Humidity plot reveals a more complex, negative relationship. This seems counter-intuitive but is explained by the seasonal context: the highest humidity occurs during the rainy season, which is also when increased cloud cover and rain suppress daytime temperatures. This highlights that Humidity's effect is often indirect (i.e. non-linear).
  • 4. Visual Confirmation of Multicollinearity:

    • The relationship between Dew and Humidity is very strong and linear, as is the inverse relationship between Solarradiation and Cloudcover.
    • This finding is critical for modeling. While a tree-based model can handle this, a linear model would struggle with such highly correlated predictors. We should create distinct feature sets later in the pipeline.

$\implies$ EDA Conclusion: Now we understand the data much better, and can make more informed decisions when processing the data, engineering the features, and modeling later on.

Q3. Data Processing & Correlation Analysis

Q3. Data Processing & Correlation Analysis

Q3.1. Column Sanitization

Q3.1. Column Sanitization

Our first processing step is to remove all columns that were identified as non-predictive during the EDA. This includes:

  • Constant Identifiers: Columns like name, address, latitude, etc., which have the same value for every record.
  • Redundant Duplicates: Columns that are exact copies of others, such as windspeedmax being identical to windspeed.
  • Purely Descriptive Text: Columns like description that will be handled by more targeted feature engineering later, not kept in their raw form.
  • Time-related Objects: sunrise and sunset will be engineered into more useful numerical features (like day length) and are not needed in their raw object format.

This step reduces memory usage and focuses our dataset on potentially valuable features.

In [10]:
df_processed = df_raw.copy()

cols_to_drop = [
    # Constant Identifiers
    "name",
    "address",
    "resolvedAddress",
    "latitude",
    "longitude",
    "source",
    # Duplicate
    "windspeedmax",
    # Descriptive/Object columns to be engineered later
    "description",
    "conditions",
    "icon",
    "sunrise",
    "sunset",
    # Redundant energy column
    "solarenergy",
]

shape_before = df_processed.shape

df_processed.drop(columns=cols_to_drop, inplace=True, errors="ignore")

shape_after = df_processed.shape

console.print(f"DataFrame shape before sanitization: [yellow]{shape_before}[/yellow]")
console.print(f"DataFrame shape after sanitization:  [green]{shape_after}[/green]")
console.print(f"Columns removed: [red]{shape_before[1] - shape_after[1]}[/red]")

console.print("\n[bold]Remaining columns in the processed DataFrame:[/bold]")
console.print(df_processed.columns.tolist())
DataFrame shape before sanitization: (3927, 37)
DataFrame shape after sanitization:  (3927, 24)
Columns removed: 13
Remaining columns in the processed DataFrame:
[
    'tempmax',
    'tempmin',
    'temp',
    'feelslikemax',
    'feelslikemin',
    'feelslike',
    'dew',
    'humidity',
    'precip',
    'precipprob',
    'precipcover',
    'preciptype',
    'windgust',
    'windspeed',
    'windspeedmean',
    'windspeedmin',
    'winddir',
    'sealevelpressure',
    'cloudcover',
    'visibility',
    'solarradiation',
    'uvindex',
    'severerisk',
    'moonphase'
]

Q3.2. Missing Value Imputation

Q3.2. Missing Value Imputation

The EDA revealed that missing values are not random but systematic. We address them with deterministic, rule-based imputation that preserves the underlying meaning of the data. Drop is no-no:)

  • preciptype: Missingness in this column occurs exclusively when the precip (precipitation amount) is zero. Therefore, we will fill these NaN values with the string 'none'. This transforms the missing data into an explicit, meaningful category for "no-precipitation days."
  • severerisk: Values in this column are only recorded when a certain risk threshold is met. The absence of a value implies the absence of risk. We will fill these NaN values with 0.

This approach avoids data deletion and ensures that every value in our dataset is explicitly defined, and potentially providing more predictive signals to the models.

In [11]:
# 1. Fill 'preciptype' based on the deterministic relationship with 'precip'
df_processed["preciptype"].fillna("none", inplace=True)

# 2. Fill 'severerisk' with 0, as missingness implies no recorded severe risk
df_processed["severerisk"].fillna(0, inplace=True)


remaining_nulls = df_processed.isnull().sum().sum()

null_count_color = "green" if remaining_nulls == 0 else "red"
console.print(
    f"Total remaining null values in the DataFrame: [bold {null_count_color}]{remaining_nulls}[/bold {null_count_color}]"
)

if remaining_nulls == 0:
    console.print("\n[bold]Value counts for 'preciptype' after imputation:[/bold]")
    print(df_processed["preciptype"].value_counts())

    console.print("\n[bold]Value counts for 'severerisk' after imputation:[/bold]")
    print(df_processed["severerisk"].value_counts())

else:
    console.print("There are still missing values that need to be addressed.")
Total remaining null values in the DataFrame: 0
Value counts for 'preciptype' after imputation:
preciptype
rain    3003
none     924
Name: count, dtype: int64
Value counts for 'severerisk' after imputation:
severerisk
0.0     2731
30.0     546
10.0     448
60.0     195
75.0       6
5.0        1
Name: count, dtype: int64

Q3.3. Correlations Check

Q3.3. Correlations Check

We will employ two different correlation methods to ensure a comprehensive understanding:

  • Pearson Correlation (r):

    • Standard method, measuring the strength of the linear relationship between two variables
    • Sensitive to outliers
    • A value of +1 indicates a perfect positive linear relationship, -1 a perfect negative linear relationship, and 0 indicates no linear relationship.
  • Spearman Correlation (ρ or rho):

    • Measures the strength of the monotonic relationship between two variables
    • Operates on the ranks of the data rather than the raw values $\rightarrow$ highly robust to outliers, capable of capturing relationships that are consistently increasing or decreasing but not necessarily linear.

Our weather data has outliers, and potential non-linearities $\implies$ the Spearman correlation provides a more reliable and robust assessment of the underlying relationships.

In [12]:
df_numerical = df_processed.select_dtypes(include=np.number)

fig_pearson = eda.plot_correlation_heatmap_enhanced(df=df_numerical, method="pearson")
fig_pearson.show()

ah.display_correlation_summary(df=df_numerical, target=config.TARGET_VARIABLE, method="pearson")
Target Correlation Summary (Pearson) 

  Most Positively Correlated                            Most Negatively Correlated with                            
         with 'temp'                                                 'temp'                                        
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓                         ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓                          
┃ Feature      ┃ Coefficient ┃                         ┃ Feature          ┃ Coefficient ┃                          
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩                         ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩                          
│ feelslike    │       0.893 │                         │ humidity         │      -0.340 │                          
│ tempmax      │       0.847 │                         │ cloudcover       │      -0.308 │                          
│ tempmin      │       0.792 │                         │ precip           │      -0.222 │                          
│ feelslikemin │       0.771 │                         │ sealevelpressure │      -0.128 │                          
│ feelslikemax │       0.634 │                         │ winddir          │      -0.087 │                          
└──────────────┴─────────────┘                         └──────────────────┴─────────────┘                          
Predictor Multicollinearity Summary (Pearson) 

       Highest Positive Predictor Pairs                           Highest Negative Predictor Pairs                 
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓            ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓         
┃ Feature 1      ┃ Feature 2    ┃ Coefficient ┃            ┃ Feature 1  ┃ Feature 2        ┃ Coefficient ┃         
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩            ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩         
│ tempmin        │ feelslikemin │       0.911 │            │ dew        │ sealevelpressure │      -0.652 │         
│ solarradiation │ uvindex      │       0.911 │            │ cloudcover │ solarradiation   │      -0.601 │         
│ temp           │ feelslike    │       0.893 │            │ humidity   │ solarradiation   │      -0.567 │         
│ tempmax        │ temp         │       0.847 │            │ cloudcover │ uvindex          │      -0.557 │         
│ feelslikemax   │ feelslike    │       0.833 │            │ humidity   │ sealevelpressure │      -0.554 │         
└────────────────┴──────────────┴─────────────┘            └────────────┴──────────────────┴─────────────┘         
In [13]:
fig_spearman = eda.plot_correlation_heatmap_enhanced(df=df_numerical, method="spearman")
fig_spearman.show()

ah.display_correlation_summary(df=df_numerical, target=config.TARGET_VARIABLE, method="spearman")
Target Correlation Summary (Spearman) 

Most Positively Correlated with                           Most Negatively Correlated with                          
             'temp'                                                    'temp'                                      
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓                         ┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓                        
┃ Feature        ┃ Coefficient ┃                         ┃ Feature          ┃ Coefficient ┃                        
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩                         ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩                        
│ feelslike      │       0.874 │                         │ humidity         │      -0.393 │                        
│ tempmax        │       0.840 │                         │ cloudcover       │      -0.335 │                        
│ tempmin        │       0.757 │                         │ precip           │      -0.160 │                        
│ feelslikemin   │       0.756 │                         │ winddir          │      -0.101 │                        
│ solarradiation │       0.599 │                         │ sealevelpressure │      -0.091 │                        
└────────────────┴─────────────┘                         └──────────────────┴─────────────┘                        
Predictor Multicollinearity Summary (Spearman) 

       Highest Positive Predictor Pairs                           Highest Negative Predictor Pairs                 
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓            ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓         
┃ Feature 1      ┃ Feature 2    ┃ Coefficient ┃            ┃ Feature 1  ┃ Feature 2        ┃ Coefficient ┃         
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩            ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩         
│ tempmin        │ feelslikemin │       0.999 │            │ dew        │ sealevelpressure │      -0.610 │         
│ solarradiation │ uvindex      │       0.881 │            │ humidity   │ solarradiation   │      -0.600 │         
│ temp           │ feelslike    │       0.874 │            │ cloudcover │ solarradiation   │      -0.584 │         
│ tempmax        │ temp         │       0.840 │            │ humidity   │ visibility       │      -0.566 │         
│ feelslikemax   │ feelslike    │       0.823 │            │ humidity   │ uvindex          │      -0.559 │         
└────────────────┴──────────────┴─────────────┘            └────────────┴──────────────────┴─────────────┘         
  • 1. Key drivers:

    • Both matrices confirm that temp is most strongly correlated with its variants (feelslike, tempmax, tempmin), which is expected.
    • More importantly, they both confirm our hypotheses about the primary physical drivers: solarradiation has a strong positive correlation with temperature, while humidity and cloudcover have the strongest negative correlations.
  • 2. Severe multicollinearity: Both methods reveal two critical clusters of extreme multicollinearity, which is a concern for linear models:

    • Temperature cluster: Features like temp, feelslike, tempmin, and feelslikemin are all highly inter-correlated (many coefficients > 0.80).
    • Humidity cluster: dew and humidity show a very strong positive relationship (r = 0.82, ρ = 0.80).
    • Solar cluster: solarradiation and uvindex are also highly correlated, effectively measuring the same underlying phenomenon of sun intensity.
  • 3. Spearman's robustness:

    • The correlation between tempmin and feelslikemin is ρ = 0.999. This indicates they are functionally identical in terms of their rank order and are perfectly redundant predictors.
    • The negative correlations for humidity and cloudcover with temp are stronger in the Spearman results $\rightarrow$ increases confidence in their importance, as their relationship is consistently monotonic, not just an artifact of a few outliers.

$\implies$ Attempting to fit a standard linear model with all of these features would result in unstable and uninterpretable coefficients. We need another approach to manage this multicollinearity.

Q3.4. Strategy for Multicollinearity

Q3.4. Strategy for Multicollinearity

We have confirmed the presence of severe multicollinearity - this occurs when two or more predictor variables are highly correlated, making it difficult for a model to disentangle their individual effects on the target variable. For standard linear models, this can lead to unstable, unreliable, and uninterpretable coefficient estimates.

  • Standard diagnostic tool: Variance Inflation Factor (VIF)

    • We can use this common method to diagnose, which calculates how much the variance of a regression coefficient is inflated due to its correlation with other predictors.
    • A high VIF (usually > 5 or 10) for a feature indicates that it is highly collinear and a candidate for removal.
  • Our choice: Regularization

    • While a manual, VIF-based feature pruning is a valid approach, we will instead use a more integrated strategy by leveraging models with built-in regularization.
    • Algorithms like RidgeCV, ElasticNetCV, and MultiTaskLassoCV (which we will use for feature selection) are specifically designed to handle multicollinearity. They add a penalty term to the loss function that shrinks the coefficients of correlated predictors.
    • In practice, this means the model itself will learn to "de-emphasize" redundant features. For example, when faced with the highly correlated tempmin and feelslikemin, a regularized model will likely assign a significant coefficient to one while shrinking the coefficient of the other towards zero. This achieves the same goal as manual VIF-based removal but in a data-driven, automated fashion that is integrated directly into the model training process.

$\implies$ Justification: By choosing regularized models and embedded feature selection, we have a more robust and sophisticated approach to managing multicollinearity. This allows the model to make the optimal data-driven trade-offs during training.

Q3.5. Final Processed DataFrame before Feature Engineering

Q3.5. Final Processed DataFrame before Feature Engineering

In [14]:
console.print(
    f"[bold]Final Shape of Processed DataFrame:[/bold] [green]{df_processed.shape}[/green]",
    "\n[bold]Data Types and Memory Usage:[/bold]",
)
df_processed.info(memory_usage="deep")
Final Shape of Processed DataFrame: (3927, 24) 
Data Types and Memory Usage:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3927 entries, 2015-01-01 to 2025-10-01
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   tempmax           3927 non-null   float64
 1   tempmin           3927 non-null   float64
 2   temp              3927 non-null   float64
 3   feelslikemax      3927 non-null   float64
 4   feelslikemin      3927 non-null   float64
 5   feelslike         3927 non-null   float64
 6   dew               3927 non-null   float64
 7   humidity          3927 non-null   float64
 8   precip            3927 non-null   float64
 9   precipprob        3927 non-null   int64  
 10  precipcover       3927 non-null   float64
 11  preciptype        3927 non-null   object 
 12  windgust          3927 non-null   float64
 13  windspeed         3927 non-null   float64
 14  windspeedmean     3927 non-null   float64
 15  windspeedmin      3927 non-null   float64
 16  winddir           3927 non-null   float64
 17  sealevelpressure  3927 non-null   float64
 18  cloudcover        3927 non-null   float64
 19  visibility        3927 non-null   float64
 20  solarradiation    3927 non-null   float64
 21  uvindex           3927 non-null   int64  
 22  severerisk        3927 non-null   float64
 23  moonphase         3927 non-null   float64
dtypes: float64(21), int64(2), object(1)
memory usage: 939.6 KB

Q4. Feature Engineering

Q4. Feature Engineering

Q4.1. Foundational Temporal Features

Q4.1. Foundational Temporal Features

One of the most critical stages of the project - we need to transform the clean, processed data into an information-rich feature set that makes the underlying patterns we discovered in the EDA explicit for our machine learning models. We will engineer features to encode four key dimensions of the time-series data:

  1. Basic timestamps: Discrete markers of time (year, month).
  2. Cyclical seasonality: Continuous representations of the annual cycle.
  3. Complex seasonality (Fourier): Features to capture multi-frequency patterns like the bimodal climate cycle.
  4. Domain-specific indicators: Meteorological features that make physical relationships explicit.
In [15]:
df_engineered = df_processed.copy()

# --- 4.1.2: Basic Temporal Features ---
# These features help the model understand the long-term trend and basic monthly groupings.
df_engineered["year"] = df_engineered.index.year
df_engineered["month"] = df_engineered.index.month
df_engineered["week_of_year"] = df_engineered.index.isocalendar().week.astype(int)
df_engineered["day_of_year"] = df_engineered.index.dayofyear

# --- 4.1.3: Cyclical Features ---
# Sine/cosine transformations are essential for cyclical data. They encode the
# "closeness" of time points (e.g., Dec 31 is close to Jan 1) in a way that
# linear models can understand.
day_of_year_max = 366.0  # Account for leap years
month_max = 12.0

df_engineered["sin_day_of_year"] = np.sin(2 * np.pi * df_engineered["day_of_year"] / day_of_year_max)
df_engineered["cos_day_of_year"] = np.cos(2 * np.pi * df_engineered["day_of_year"] / day_of_year_max)
df_engineered["sin_month"] = np.sin(2 * np.pi * df_engineered["month"] / month_max)
df_engineered["cos_month"] = np.cos(2 * np.pi * df_engineered["month"] / month_max)

# --- 4.1.4: Fourier Series Features ---
# HCMC has a bimodal seasonal pattern. While the annual sin/cos
# captures the main cycle, higher-frequency Fourier terms can help model this
# more complex, multi-frequency seasonality (e.g., a semi-annual cycle). Just maybe, but hey it doesn't hurt to try.
for k in range(2, 4):  # Add k=2 (semi-annual) and k=3 (quarterly) terms
    df_engineered[f"fourier_sin_{k}"] = np.sin(2 * np.pi * k * df_engineered["day_of_year"] / day_of_year_max)
    df_engineered[f"fourier_cos_{k}"] = np.cos(2 * np.pi * k * df_engineered["day_of_year"] / day_of_year_max)


temporal_features_added = [
    "year",
    "month",
    "week_of_year",
    "day_of_year",
    "sin_day_of_year",
    "cos_day_of_year",
    "sin_month",
    "cos_month",
    "fourier_sin_2",
    "fourier_cos_2",
    "fourier_sin_3",
    "fourier_cos_3",
]
console.print(
    "[bold]Temporal features created.[/bold]\n",
    f" - Number of new features: [green]{len(temporal_features_added)}[/green]\n",
    f" - DataFrame shape is now: [green]{df_engineered.shape}[/green]",
)

console.print("\n[bold]Sample of new temporal features:[/bold]")
display(df_engineered[temporal_features_added].head())
Temporal features created.
  - Number of new features: 12
  - DataFrame shape is now: (3927, 36)
Sample of new temporal features:
year month week_of_year day_of_year sin_day_of_year cos_day_of_year sin_month cos_month fourier_sin_2 fourier_cos_2 fourier_sin_3 fourier_cos_3
datetime
2015-01-01 2015 1 1 1 0.017166 0.999853 0.5 0.866025 0.034328 0.999411 0.051479 0.998674
2015-01-02 2015 1 1 2 0.034328 0.999411 0.5 0.866025 0.068615 0.997643 0.102821 0.994700
2015-01-03 2015 1 1 3 0.051479 0.998674 0.5 0.866025 0.102821 0.994700 0.153891 0.988088
2015-01-04 2015 1 1 4 0.068615 0.997643 0.5 0.866025 0.136906 0.990584 0.204552 0.978856
2015-01-05 2015 1 2 5 0.085731 0.996318 0.5 0.866025 0.170830 0.985301 0.254671 0.967028

Q4.2. Domain-Specific Feature Creation

Q4.2. Domain-Specific Feature Creation

Beyond generic time-based signals, we can engineer features that make the physical relationships we hypothesized in our EDA explicit. By calculating these values, we provide the model with pre-computed, high-value predictors that directly relate to the drivers of temperature in a tropical climate.

In [16]:
# --- 4.2.1: Daylight Hours ---
# This is a more direct measure of solar energy potential than day_of_year.
sunrise_sunset = pd.read_csv(
    config.DAILY_DATA_PATH, usecols=["datetime", "sunrise", "sunset"], parse_dates=["datetime", "sunrise", "sunset"]
).set_index("datetime")

df_engineered["daylight_hours"] = (sunrise_sunset["sunset"] - sunrise_sunset["sunrise"]).dt.total_seconds() / 3600

# --- 4.2.2: Daily Temperature Range ---
# This feature quantifies daily temperature volatility. A low range often indicates
# a cloudy or rainy day, while a high range suggests a clear, sunny day.
df_engineered["temp_range"] = df_engineered["tempmax"] - df_engineered["tempmin"]

# --- 4.2.3: Dew Point Deficit ---
# This measures how close the air is to saturation. A smaller deficit means
# higher relative humidity and can inhibit nighttime cooling.
df_engineered["dew_point_deficit"] = df_engineered["temp"] - df_engineered["dew"]


domain_features_added = ["daylight_hours", "temp_range", "dew_point_deficit"]
console.print(
    "[bold]Domain-specific features created.[/bold]\n",
    f" - Number of new features: [green]{len(domain_features_added)}[/green]\n",
    f" - DataFrame shape is now: [green]{df_engineered.shape}[/green]",
)

console.print("\n[bold]Sample of new domain-specific features:[/bold]")
display(df_engineered[domain_features_added].head())
Domain-specific features created.
  - Number of new features: 3
  - DataFrame shape is now: (3927, 39)
Sample of new domain-specific features:
daylight_hours temp_range dew_point_deficit
datetime
2015-01-01 11.504722 8.0 8.7
2015-01-02 11.507222 10.0 9.2
2015-01-03 11.509722 9.0 7.8
2015-01-04 11.512500 8.0 5.8
2015-01-05 11.515556 5.7 4.6

Q4.3. Lag & Rolling Window Features

Q4.3. Lag & Rolling Window Features

These features are the cornerstone of time-series forecasting, allowing the model to learn from the recent past.

  • Lag features provide a direct "memory" of a variable's value from previous days. For example, the temperature from yesterday (temp_lag_1) is one of the single best predictors for the temperature today. We will test just how much it is with a naive baseline (persistence) model later. We will have the lookback window from 1 and up to 28 days. We don't use temp_lag_365 (this day, last year's) due to the (too) large gap between them, it's noisy and better captured by recent temp lags and previous features like sin/cos encoding and fourier features. Plus, one Earth's year isn't exactly 365 days (we have leap years, and even leap years are still not "correct").

  • Rolling window features provide "context" by summarizing a variable's behavior over a recent period. This captures momentum and volatility. For example, the 7-day average temperature (temp_roll_7d_mean) tells the model if the past week has been generally hotter or cooler than the seasonal average.

Preventing data leakage:

  • To calculate rolling window features for a given day t, we only use data from t-1 and earlier
  • We prevent including the value from day t in its own rolling calculation by applying .shift(1) to the data before the .rolling() operation.
In [17]:
# --- 4.3.1: Lag Features ---
# Define the features to lag and the time periods for the lookback.
# We choose key dynamic variables that characterize the daily weather system.
features_to_lag = [
    "temp",
    "humidity",
    "solarradiation",
    "precip",
    "windspeed",
    "cloudcover",
    "dew",
    "sealevelpressure",
    "windgust",
]
lag_periods = [1, 2, 3, 7, 14, 28]

for feature in features_to_lag:
    for lag in lag_periods:
        df_engineered[f"{feature}_lag_{lag}"] = df_engineered[feature].shift(lag)

# --- 4.3.2: Rolling Window Features ---
# Define features and windows for rolling statistics.
features_for_rolling = [
    "temp",
    "humidity",
    "windspeed",
    "solarradiation",
    "dew",
    "cloudcover",  # Precip has its own logic
]
window_sizes = [7, 14, 28]
rolling_stats = ["mean", "std"]

for feature in features_for_rolling:
    # Shift the series by 1 to prevent data leakage.
    series_shifted = df_engineered[feature].shift(1)
    for window in window_sizes:
        for stat in rolling_stats:
            col_name = f"{feature}_roll_{window}d_{stat}"
            df_engineered[col_name] = series_shifted.rolling(window=window, min_periods=1).agg(stat)

# Specific logic for precipitation: sum is more meaningful than mean.
precip_shifted = df_engineered["precip"].shift(1)
for window in window_sizes:
    df_engineered[f"precip_roll_{window}d_sum"] = precip_shifted.rolling(window=window, min_periods=1).sum()


shape_before_fe = df_processed.shape[1]
shape_after_fe = df_engineered.shape[1]
num_new_features = shape_after_fe - shape_before_fe - len(temporal_features_added) - len(domain_features_added)

console.print(
    "[bold]Lag and rolling features created.[/bold]\n",
    f" - Number of new lag/rolling features: [green]{num_new_features}[/green]\n",
    f" - DataFrame shape is now: [green]{df_engineered.shape}[/green]",
)
Lag and rolling features created.
  - Number of new lag/rolling features: 93
  - DataFrame shape is now: (3927, 132)

Q4.4. Categorical Feature Encoding

Q4.4. Categorical Feature Encoding

The dataset contains one structured categorical feature, preciptype, which currently holds the values 'rain' and 'none'. We must convert it into a numerical format so the model can understand it.

  • Chosen method: One-hot Encoding
    • We will use One-hot Encoding to convert the single preciptype column into two new binary (0/1) columns: preciptype_rain and preciptype_none.
  • Justification:
    • We didn't choose Dummy Encoding (k-1 columns) because it preserves the explicit interpretability of each category.
    • While creating k columns can introduce perfect multicollinearity (since one column can be perfectly predicted by the others), this is not a concern for the tree-based models. For linear models, the built-in regularization will effectively handle this redundancy. Therefore, prioritizing clarity and interpretability is the better choice.
In [18]:
# Perform One-Hot Encoding on the 'preciptype' column
df_engineered = pd.get_dummies(
    df_engineered,
    columns=["preciptype"],
    prefix="preciptype",
    dtype=int,
)


encoded_cols = [col for col in df_engineered.columns if "preciptype_" in col]
console.print(
    "[bold]Original 'preciptype' column removed.[/bold]\n",
    f" - New binary columns added: `{encoded_cols}`\n",
    f" - DataFrame shape is now: [green]{df_engineered.shape}[/green]",
)

console.print("\n[bold]Sample of new encoded features:[/bold]")
display(df_engineered[encoded_cols].head())
Original 'preciptype' column removed.
  - New binary columns added: `['preciptype_none', 'preciptype_rain']`
  - DataFrame shape is now: (3927, 133)
Sample of new encoded features:
preciptype_none preciptype_rain
datetime
2015-01-01 1 0
2015-01-02 1 0
2015-01-03 0 1
2015-01-04 1 0
2015-01-05 0 1

Q4.5. Unstructured Text Feature Engineering

Q4.5. Unstructured Text Feature Engineering

  • The dataset contains a description column with human-readable weather summaries (e.g., "Partly cloudy throughout the day with rain in the morning."). While our numerical features capture quantitative data, this text may contain valuable qualitative nuances.
  • To leverage this information, we will employ a standard and robust Natural Language Processing (NLP) pipeline to convert the text into a set of numerical features.
  • Chosen method: TF-IDF + Truncated SVD
    1. TF-IDF (Term Frequency-Inverse Document Frequency): converts the text into a high-dimensional matrix, where each value represents the importance of a word to a specific day's description.
    2. Truncated SVD (Singular Value Decomposition): The TF-IDF matrix is very sparse and has too many dimensions (one for each unique word) $\rightarrow$ Truncated SVD will reduce this matrix into a small number of dense components (5 in our case). In other words, we can understand it as capturing the core "topics" or "concepts" within the weather descriptions.
In [19]:
text_data = df_raw[["description"]].copy()
df_engineered = df_engineered.join(text_data)

# --- Define the NLP processing pipeline ---
n_svd_components = 7  # The `description` field has 26 distinct features, mostly repetitive, so 7 is sufficient, I think
nlp_pipeline = Pipeline([
    (
        "tfidf",
        TfidfVectorizer(
            stop_words="english",
            ngram_range=(1, 2),  # Capture both single words and two-word phrases
            max_features=1000,  # Limit vocabulary to the top 1000 terms
        ),
    ),
    ("svd", TruncatedSVD(n_components=n_svd_components, random_state=config.GLOBAL_RANDOM_SEED)),
])

text_features = nlp_pipeline.fit_transform(df_engineered["description"])

text_feature_names = [f"text_svd_{i}" for i in range(n_svd_components)]
df_text_features = pd.DataFrame(text_features, index=df_engineered.index, columns=text_feature_names)

df_engineered = pd.concat([df_engineered, df_text_features], axis=1)

df_engineered.drop(columns=["description"], inplace=True)


console.print(
    f" - Number of new text-based features (SVD components): [green]{n_svd_components}[/green]\n",
    f" - DataFrame shape is now: [green]{df_engineered.shape}[/green]",
    sep="",
)

console.print("\n[bold]Sample of new text features:[/bold]")
display(df_engineered[text_feature_names].head())
 - Number of new text-based features (SVD components): 7
 - DataFrame shape is now: (3927, 140)
Sample of new text features:
text_svd_0 text_svd_1 text_svd_2 text_svd_3 text_svd_4 text_svd_5 text_svd_6
datetime
2015-01-01 0.879053 -0.176581 -0.320034 -0.237530 -0.112032 -0.016034 0.155117
2015-01-02 0.879053 -0.176581 -0.320034 -0.237530 -0.112032 -0.016034 0.155117
2015-01-03 0.555798 -0.049113 0.585367 0.280771 -0.240929 0.061356 0.448770
2015-01-04 0.879053 -0.176581 -0.320034 -0.237530 -0.112032 -0.016034 0.155117
2015-01-05 0.407113 -0.081170 0.087575 0.089282 -0.493769 -0.186068 -0.152486

Q4.6. Final Verification & NaN Handling

Q4.6. Final Verification & NaN Handling

Now we verify the structure and acknowledge the presence of NaN values. These NaNs are an expected and correct outcome of the lag and rolling window operations. The initial rows of the DataFrame lack sufficient historical data for these calculations (e.g., since the first day has no "day before" to lag from).

The number of rows containing these NaNs is determined by our longest lookback period. These rows will be dropped in the next section, after we create our target variables. This ensures a perfect, leak-free alignment between our features (X) and the values we aim to predict (y).

In [20]:
max_lag = max(lag_periods)
max_rolling_window = max(window_sizes)
max_lookback = max(max_lag, max_rolling_window)

console.print(
    f"Longest lag period: [cyan]{max_lag}[/cyan] days\n",
    f"Longest rolling window: [cyan]{max_rolling_window}[/cyan] days\n",
    f"==> Maximum lookback period: [bold yellow]{max_lookback}[/bold yellow] days\n",
    f"\nThis means the first {max_lookback} rows will contain NaN values and will be dropped during the modeling prep stage.\n",
    f"[bold]Final Shape of DataFrame:[/bold] [green]{df_engineered.shape}[/green]",
    sep="",
)

console.print("[bold]First 5 Rows (showing initial NaNs):[/bold]")
display(df_engineered.head())
Longest lag period: 28 days
Longest rolling window: 28 days
==> Maximum lookback period: 28 days

This means the first 28 rows will contain NaN values and will be dropped during the modeling prep stage.
Final Shape of DataFrame: (3927, 140)
First 5 Rows (showing initial NaNs):
tempmax tempmin temp feelslikemax feelslikemin feelslike dew humidity precip precipprob ... precip_roll_28d_sum preciptype_none preciptype_rain text_svd_0 text_svd_1 text_svd_2 text_svd_3 text_svd_4 text_svd_5 text_svd_6
datetime
2015-01-01 31.0 23.0 26.6 31.4 23.0 26.8 17.9 60.3 0.0 0 ... NaN 1 0 0.879053 -0.176581 -0.320034 -0.237530 -0.112032 -0.016034 0.155117
2015-01-02 30.0 20.0 25.0 30.4 20.0 25.1 15.8 57.2 0.0 0 ... 0.0 1 0 0.879053 -0.176581 -0.320034 -0.237530 -0.112032 -0.016034 0.155117
2015-01-03 32.0 23.0 26.8 33.5 23.0 27.4 19.0 63.7 0.2 100 ... 0.0 0 1 0.555798 -0.049113 0.585367 0.280771 -0.240929 0.061356 0.448770
2015-01-04 32.0 24.0 27.1 34.8 24.0 28.3 21.3 71.7 0.0 0 ... 0.2 1 0 0.879053 -0.176581 -0.320034 -0.237530 -0.112032 -0.016034 0.155117
2015-01-05 30.6 24.9 26.7 33.2 24.9 27.7 22.1 76.4 6.0 100 ... 0.2 0 1 0.407113 -0.081170 0.087575 0.089282 -0.493769 -0.186068 -0.152486

5 rows × 140 columns

Q5. Modeling Strategy & Evaluation

Q5. Modeling Strategy & Evaluation

Q5.1. Forecasting Strategies, Target Definition & Data Splitting

Q5.1. Forecasting Strategies, Target Definition & Data Splitting

Q5.1.1. Defining the Forecasting Strategy

Q5.1.1. Defining the Forecasting Strategy

Forecasting Architectures¶

1. Recursive strategy:

  • Method: A single model is trained to predict only one step ahead (t+1). To forecast t+2, the model's own prediction for t+1 is fed back into its input features. This process is repeated for each subsequent step.
  • Drawback: This approach is highly susceptible to error accumulation. Any small error in the t+1 forecast becomes part of the input for the t+2 forecast, and these errors compound, often leading to a rapid degradation in performance for longer horizons.

2. Multi-output strategy:

  • Method: A single, more complex model is trained to predict all five forecast horizons (t+1 through t+5) simultaneously in one forward pass.
  • Drawback: This model is a "generalist." While computationally efficient, it often struggles to optimize for any single horizon, resulting in performance that is good on average but rarely the best for any specific day.

3. Direct strategy (We choose this):

  • Method: We train a separate, independent, specialist model for each forecast horizon. The model for t+3, for example, is trained exclusively on data pairs of (Features at time t, Target at time t+3).
  • Justification: This is the most robust and potentially highest-performing strategy for this problem. It avoids error accumulation and allows each model to become an expert at its specific task, learning the unique feature relationships that are most important for that particular forecast distance. While it requires training multiple models, the gain in accuracy and reliability is a worthwhile trade-off, especially since the amount of training data for the daily dataset is relatively small, and only have to predict 5 models (5 horizons).

What to Predict?¶

1. Predicting the direct value (We choose this): The most straightforward and robust approach is to predict the final temperature value directly. The previous feature engineering step (lags, rolling windows, cyclical features) has already provided the model with most necessary information about trends, seasonality, and recent changes, allowing the model to learn the complete mapping from features to the final target value end-to-end.

2. Predicting the delta: We could predict the change in temperature from day t to t+h. The final forecast is then Temp_t + Predicted_Delta. This can be effective if the series is strongly trending, but it can be unstable. For temperature, especially for a relatively "stable" area like Ho Chi Minh city, the deltas wouldn't be clear, doesn't have strong strends (it has some, as noted in EDA above, but not clear enough of a signal), which makes it hard for the model to reliably predict the change in temperature from day to day with great accuracy.

3. Predicting the residual: After de-trending and de-seasonalizing (as in our STL decomposition), we could model the remaining "noise" or residual. It's more common technique in classical time-series analysis but often more complex to implement in ML effectively, and same with predicting the delta approach above, the residual don't have any strong, clear trends or big enough of a difference for the model to reliably learn the pattern and predict.

$\implies$ We will implement a direct forecasting strategy where we train five specialist models, each tasked with predicting the absolute temperature value for its assigned horizon (t+1 through t+5).

Q5.1.2. Target Variable Creation & Final Alignment

Q5.1.2. Target Variable Creation & Final Alignment

Since we're using the direct forecasting strategy, we need to create the five distinct target variables, done by shifting the temp column backwards in time. For example, the target for day t's features will be the actual temperature from day t+1, t+2, and so on.

This will introduce NaN values at the end of the DataFrame (as there is no future data for the last few days) and at the beginning (from the lag/rolling features). We will then need to drop all rows containing any NaN values, ensuring every row in the final DataFrame is complete and correct.

In [21]:
df_model_ready = df_engineered.copy()

# --- 1. Create the Multi-Horizon Target Variables ---
for h in config.HORIZONS:
    df_model_ready[f"target_temp_t+{h}"] = df_model_ready["temp"].shift(-h)

# --- 2. Final Alignment via NaN Removal ---
shape_before_align = df_model_ready.shape
df_model_ready.dropna(inplace=True)
shape_after_align = df_model_ready.shape

rows_dropped = shape_before_align[0] - shape_after_align[0]

console.print(
    f"Shape before final alignment: [green]{shape_before_align}[/green]\n",
    f"Shape after final alignment:  [green]{shape_after_align}[/green]\n",
    f"Rows dropped due to lookback/lookahead: [red]{rows_dropped}[/red]",
    sep="",
)
Shape before final alignment: (3927, 145)
Shape after final alignment:  (3894, 145)
Rows dropped due to lookback/lookahead: 33

Q5.1.3. The Final Train-Test Split

Q5.1.3. The Final Train-Test Split

  • Training Set (_train): This data (the first 85%) is used for all model training, cross-validation, and hyperparameter tuning.
  • Test Set (_test): This data (the final 15%) is held out and completely untouched until the very end of the project. The set simulates the future, unseen datawhich the top model model will be evaluated on this set once to produce its final, unbiased performance score.

We adopt this Train/Test with Cross-Validation approach over a fixed Train/Validation/Test split:

  • A fixed validation set can be small and idiosyncratic, leading a model to overfit to its specific patterns
  • By using TimeSeriesSplit cross-validation on the larger training set, we evaluate our models across multiple, diverse time periods, leading to a much more robust and reliable estimate of their true generalization performance.
  • After evaluating the models to choose the most suitable one, we will train that model once on the entire training set, providing it with more recent data (closer to the test set). This is even more important because weather/climate is non-stationary, we may have data drift and/or concept drift, so the model may perform differently (usually worse) as its training data and test data grows further apart. More on that in Q7.
In [22]:
# --- 1. Define Feature and Target Columns ---
target_cols = [f"target_temp_t+{h}" for h in config.HORIZONS]

# We must drop the original 'temp' column as is now a source of data leakage (because it's
# from the same day as the target values we are trying to predict)
features_to_drop = target_cols + ["temp"]

# All remaining columns are predictors.
feature_cols = [col for col in df_model_ready.columns if col not in features_to_drop]

X = df_model_ready[feature_cols]
y = df_model_ready[target_cols]

# --- 2. Perform the Chronological Split ---
n_total = len(X)
n_test = int(n_total * config.TEST_SIZE)
n_train = n_total - n_test

X_train_full = X.iloc[:n_train]
y_train = y.iloc[:n_train]

X_test_full = X.iloc[n_train:]
y_test = y.iloc[n_train:]


console.print(
    f" - Total samples:     {n_total}\n",
    f" - Training set size: {n_train} ({1 - config.TEST_SIZE:.0%})\n",
    f" - Test set size:     {n_test} ({config.TEST_SIZE:.0%})",
    sep="",
)

console.rule("[bold]Time Range of the sets:[/bold]", align="center")
console.print(
    f"Training Set Time Range:  {X_train_full.index.min().date()} to {X_train_full.index.max().date()}\n",
    f"Test Set Time Range:      {X_test_full.index.min().date()} to {X_test_full.index.max().date()}",
    sep="",
)
 - Total samples:     3894
 - Training set size: 3310 (85%)
 - Test set size:     584 (15%)
───────────────────────────────────────────── Time Range of the sets: ─────────────────────────────────────────────
Training Set Time Range:  2015-01-29 to 2024-02-20
Test Set Time Range:      2024-02-21 to 2025-09-26

Q5.1.4. Creating Specialized Feature Subsets

Q5.1.4. Creating Specialized Feature Subsets

As established in our EDA, severe multicollinearity poses a significant challenge for linear models. To address this, we now create two distinct, purpose-built feature sets for our upcoming model bake-off:

  1. X_train_tree / X_test_tree: A comprehensive (full) feature set containing all engineered predictors. This set is designed for tree-based models (like LightGBM, XGBoost) and other non-linear architectures which are inherently robust to multicollinearity.

  2. X_train_linear / X_test_linear: A carefully curated, parsimonious feature set for our linear models (RidgeCV, ElasticNetCV, etc.). This subset will be created with data-driven methodology that penalizes feature redundancy.

We choose: MultiTaskLassoCV

  • What it is: Lasso (Least Absolute Shrinkage and Selection Operator) is a form of linear regression that includes an $L_1$ penalty term. This penalty forces the coefficients of less important or redundant features to shrink to exactly zero, effectively performing automated feature selection.
  • Why MultiTask? Standard Lasso works on a single target. Since we have five forecast horizons, MultiTaskLassoCV jointly fits models for all five targets simultaneously. It applies a group penalty that encourages the selection of a common set of features that are useful across all horizons. This is ideal for finding a single, robust feature set for our entire direct forecasting system. We could use LassoCV for each of the horizon, but that increases uncessary complexity, reduce interpretability, all without much (if any) meaningful gains.
  • Why CV? The CV (Cross-Validation) component automatically finds the optimal strength of the regularization penalty (alpha) using our TimeSeriesSplit cross-validation strategy, ensuring the feature selection process is itself robust and leak-free.

In essence, by fitting MultiTaskLassoCV on the full training set, we are asking the data itself to identify the most potent and non-redundant subset of predictors, creating a lean, powerful feature set perfectly tailored for our linear models.

In [23]:
X_train_tree = X_train_full
X_test_tree = X_test_full

# --- Use MultiTaskLassoCV to select a low-collinearity subset for linear models ---
# Fit only on the training data
# lasso_pipeline = Pipeline([
#     ("scaler", RobustScaler()),
#     (
#         "lasso",
#         MultiTaskLassoCV(
#             cv=TimeSeriesSplit(n_splits=5),
#             random_state=config.GLOBAL_RANDOM_SEED,
#             n_jobs=-1,
#         ),
#     ),
# ])

# lasso_pipeline.fit(X_train_full, y_train)

# # Identify features with non-zero coefficients.
# coefficients = lasso_pipeline.named_steps["lasso"].coef_
# selected_features_mask = np.abs(coefficients).sum(axis=0) > 0
selected_features_mask = joblib.load("./assets/saved_results/selected_features_mask.pkl") # Load cho nhanh
linear_feature_cols = X_train_full.columns[selected_features_mask].tolist()

X_train_linear = X_train_full[linear_feature_cols]
X_test_linear = X_test_full[linear_feature_cols]


console.print(
    f" - Tree-based model feature count: [cyan]{X_train_tree.shape[1]}[/cyan]\n",
    f" - Linear model feature count:     [cyan]{X_train_linear.shape[1]}[/cyan] (selected by LASSO)",
    sep="",
)
 - Tree-based model feature count: 139
 - Linear model feature count:     41 (selected by LASSO)
In [24]:
ah.display_lasso_summary_table(
    all_features=X_train_tree.columns.tolist(),
    selected_features=X_train_linear.columns.tolist(),
)

fig_sunburst_final = ah.plot_lasso_summary_sunburst(
    all_features=X_train_tree.columns.tolist(),
    selected_features=X_train_linear.columns.tolist(),
)
fig_sunburst_final.show()
           LASSO Feature Selection Summary           
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━┓
┃ Feature Category         ┃ Kept ┃ Dropped ┃ Total ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━┩
│ Raw Predictor            │  13  │    9    │  22   │
│ Temporal (Cyclical)      │  4   │    4    │   8   │
│ Text (SVD)               │  3   │    4    │   7   │
│ Lag (Humidity)           │  1   │    5    │   6   │
│ Lag (Dew)                │  2   │    4    │   6   │
│ Lag (Cloudcover)         │  1   │    5    │   6   │
│ Rolling (Dew)            │  0   │    6    │   6   │
│ Rolling (Cloudcover)     │  2   │    4    │   6   │
│ Lag (Precip)             │  6   │    0    │   6   │
│ Lag (Sealevelpressure)   │  2   │    4    │   6   │
│ Lag (Windgust)           │  0   │    6    │   6   │
│ Lag (Windspeed)          │  0   │    6    │   6   │
│ Lag (Temp)               │  3   │    3    │   6   │
│ Lag (Solarradiation)     │  0   │    6    │   6   │
│ Rolling (Windspeed)      │  1   │    5    │   6   │
│ Rolling (Temp)           │  2   │    4    │   6   │
│ Rolling (Humidity)       │  0   │    6    │   6   │
│ Rolling (Solarradiation) │  1   │    5    │   6   │
│ Temporal (Basic)         │  0   │    4    │   4   │
│ Domain-Specific          │  0   │    3    │   3   │
│ Rolling (Precip)         │  0   │    3    │   3   │
│ Categorical              │  0   │    2    │   2   │
└──────────────────────────┴──────┴─────────┴───────┘

LASSO provided insights into which features are most valuable for a linear forecasting model.

  • Aggressive pruning of redundancy: LASSO reduced the feature set by 70%, from 139 down to 41. It heavily pruned the highly collinear Lag and Rolling Window features, which made up the bulk of the original set.

  • Preference for recent & long-term history:

    • For the primary temp variable, LASSO kept the most recent history (temp_lag_1, _2, _3), confirming that yesterday's temperature is a powerful predictor.
    • Interestingly, for other atmospheric variables like dew and cloudcover, it discarded recent lags but kept the long-term lags (14 and 28 days), consistently. This suggests that for a linear model, the broader climate context from 2-4 weeks ago is more valuable than the noisy data from 1-2 days ago.
  • Value of raw predictors: LASSO retained a core set of 13 raw predictors, including fundamental drivers like solarradiation and cloudcover, after deeming many of their derivative lag/rolling features redundant.

  • Rejection of complex/derived Features:

    • The model dropped all domain-specific features (daylight_hours, temp_range, etc.) and all categorical flags (preciptype_none/rain).
    • Perhaps, for a linear model, the information contained in these features was likely already captured more effectively by the combination of raw predictors and selected lags. For example, the information in temp_range is already present in the raw tempmax and tempmin features, which LASSO chose to keep.

Q5.2. Evaluation Framework

Q5.2. Evaluation Framework

Q5.2.1. Evaluation Metrics

Q5.2.1. Evaluation Metrics

Before training any models, we must define how we will measure "good." A single metric is rarely sufficient; a holistic view requires assessing performance from multiple perspectives. The following table defines the standard regression metrics we will consider.

METRIC FORMULA CORE QUESTION INTEPRETATIONS
R-squared ($R^2$) $1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$ "How much of the temperature's variance is explained by our model?" A scale-free score from $(-\infty, 1]$. A score of 0.7 means our model explains 70% of the variance. Primary metric for model comparison.
Adjusted $R^2$ $1 - \frac{(1-R^2)(n-1)}{n-p-1}$ "What is the $R^2$, after penalizing for adding more features?" Useful for explanatory linear modeling to avoid overfitting by adding useless predictors. Not used in our final reports.
RMSE $\sqrt{\frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ "What is the typical magnitude of our error, penalizing larger errors more?" Also in degrees Celsius. Because errors are squared, RMSE is more sensitive to large forecast misses than MAE. A key indicator of error size.
MAE $\frac{1}{n} \sum_{i=1}^{n} \|y_i - \hat{y}_i\|$ "On average, what is the absolute error of our forecast in degrees?" Highly interpretable. An MAE of 0.5 means our forecasts are off by an average of 0.5°C. Robust to outliers.
MAPE $\frac{100}{n} \sum_{i=1}^{n} \|\frac{y_i - \hat{y}_i}{y_i}\|$ "On average, what is the percentage error of our forecast?" A relative error metric. A MAPE of 2% means our forecasts are off by an average of 2% of the actual temperature.

Based on the definitions above, we will adopt the following reporting hierarchy:

  1. Primary metric for selection: R-squared ($R^2$) will be the primary metric used in our bake-off leaderboards to compare the relative performance of different model architectures.
  2. Key error metrics for interpretation: RMSE and MAE will be reported for our finalist and champion models. They provide a direct, interpretable measure of the forecast error in degrees Celsius.
  3. Secondary relative metric: MAPE will be reported for the final champion model to provide a sense of percentage error.
  4. Excluded metric: We will not report Adjusted $R^2$. For our predictive task, the use of a robust cross-validation strategy on held-out data (TimeSeriesSplit) is a more direct and applicable method for penalizing model complexity and assessing true generalization performance.

Q5.2.2. Cross-Validation Strategy

Q5.2.2. Cross-Validation Strategy

To reliably estimate a model's performance on unseen data, we must use a cross-validation strategy that respects the temporal nature of our data.

  • Invalid method: standard K-Fold CV

    • Standard K-Fold randomly shuffles and splits the data into k folds.
    • For time-series data, this is a catastrophic error. It allows the model to train on data from the future (e.g., 2023) to make predictions on data from the past (e.g., 2021), which is data leakage, leading to wildly optimistic and completely invalid performance estimates.
  • Chosen method: TimeSeriesSplit

    • TimeSeriesSplit creates a series of expanding or sliding window splits. In each fold, the training set always consists of data that occurred before the validation set.

    • How it Works (for 5 splits): Split the train set into 6 blocks

      • Fold 1: Train on [Block 1], validate on [Block 2]
      • Fold 2: Train on [Block 1, Block 2], validate on [Block 3]
      • Fold 3: Train on [Block 1, 2, 3], validate on [Block 4]
      • ...and so on.

$\implies$ This process perfectly simulates a real-world production scenario where a model is periodically retrained on all available historical data to make forecasts for the immediate future. The average score across these folds provides a robust, realistic, and leak-free estimate of our model's generalization ability. All model selection decisions in this notebook will be based on performance using this TimeSeriesSplit strategy.

Q5.3. The Baseline Model: Establishing the Performance Floor

Q5.3. The Baseline Model: Establishing the Performance Floor

A baseline provides the "performance floor"—the minimum score that any sophisticated model must convincingly beat to be considered useful. Without a baseline, a model's performance score lacks context.

Several options for a baseline exist (for the current daily dataset):

  • Historical average: Predict every day with the overall average temperature from the training set. This is the simplest possible baseline, but it is too naive for our data as it completely ignores the powerful seasonal cycle we identified.

  • Seasonal naive: Predict the temperature for a given day using the temperature from the same day one year prior (e.g., $\hat{y}_t = y_{t-365}$). This is a stronger baseline that accounts for seasonality.

  • Persistence model (We choose this): This model operates on a simple, powerful heuristic: "the best guess for the near future is the recent past." It predicts the temperature for a future day t+h using the temperature from day t.

    • Formula: $\hat{y}_{t+h} = y_t$
    • Justification: We choose the persistence model as our primary baseline because it represents the most fundamental temporal relationship in the data. For short horizons like t+1, it can be surprisingly difficult to beat. Any model we build must first demonstrate that it can leverage our rich feature set to understand the weather dynamics better than this simple persistence rule. It provides the true, lowest bar for a time-aware forecast.
In [25]:
tscv_splitter = TimeSeriesSplit(n_splits=5)

baseline_cv_scores = eval.evaluate_persistence_model(
    df_model_ready=df_model_ready,
    X_train=X_train_full,
    y_train_all_horizons=y_train,
    horizons=config.HORIZONS,
    cv_splitter=tscv_splitter,
)

baseline_cv_scores_for_table = {
    horizon: data["scores"]["r2"] for horizon, data in baseline_cv_scores["Baseline (Persistence)"].items()
}
eval.display_baseline_results(baseline_cv_scores_for_table)
   Persistence (Naive) Baseline Performance    
       (Cross-Validated on Training Set)       
┏━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ Horizon ┃ Mean CV R² ┃ Std Dev of CV Scores ┃
┡━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│   t+1   │   0.5455   │        0.0675        │
│   t+2   │   0.2080   │        0.1193        │
│   t+3   │   0.0634   │        0.1463        │
│   t+4   │   0.0172   │        0.1399        │
│   t+5   │  -0.0467   │        0.1391        │
│ Average │   0.1575   │        0.1224        │
└─────────┴────────────┴──────────────────────┘
  • Decent short-term performance: A mean CV R² of 0.5455 for the t+1 horizon is decent for a naive model. It confirms that for Ho Chi Minh's stable weather system, today's temperature is a strong predictor of tomorrow's. Our t+1 models must convincingly beat this high bar.

  • Rapid performance decay: As expected, the model's performance collapses as the forecast horizon increases, becoming worse than a simple historical average by t+5 (negative R² score). This highlights the core challenge of our project: predicting temperature several days in advance, where simple persistence fails completely.

  • The importance of a robust CV score:

    • The average CV score (e.g., 0.5455 for t+1) is a conservative but realistic estimate of the model's performance over time.
    • It is expected to be lower than a one-off score on the final test set (which for t+1 is ~0.685) because the TimeSeriesSplit forces the model to make predictions across various time periods, including older data where patterns might be different. A model that performs well across all CV folds is more robust and reliable than one that only performs well on the most recent data.

$\implies$ Goal: Our models must demonstrate their value by not only outperforming this baseline on average, but especially by providing a significant performance lift on the challenging t+3 to t+5 horizons, where the naive model provides no value.

Q5.4. Model Bake-Off

Q5.4. Model Bake-Off

We will now systematically evaluate a diverse range of model architectures to identify the most promising candidate for our champion model. This process is conducted with the following protocol:

  • Comprehensive saves: We use the run_bakeoff utility function to generate and save detailed result files (.pkl). These files contain not just the final scores, but the scores and predictions from every fold, allowing for deep diagnostic analysis.
  • Presentation reporting: We then load those pre-computed artifacts to generate summary tables and visualizations. The config.TRAIN_NEW_MODELS flag allows us to optionally re-run the training process if needed, but load the pre-computed ones by default to save time.

We will evaluate four main families of models:

  1. Standard ML models: Linear and tree-based models that serve as strong, interpretable baselines.
  2. Advanced time-series models: Architectures like SARIMAX and Prophet that are specifically designed for time-series data.
  3. Deep learning models: Models like LSTM, CNN, Seq2Seq, with attention strategy may capture the pattern better, especially with longer horizons.
  4. Ensemble models: Hybrid models that combine the predictions of multiple base models to improve robustness and accuracy.

Q5.4.1. Stage 1: High-Level Leaderboard

Q5.4.1. Stage 1: High-Level Leaderboard

In [26]:
results_files_to_load = [
    "./assets/saved_results/bakeoff_results_linear.pkl",
    "./assets/saved_results/bakeoff_results_tree.pkl",
    "./assets/saved_results/bakeoff_results_advanced_ts.pkl",
    "./assets/saved_results/bakeoff_results_dl.pkl",
]

df_leaderboard = eval.generate_leaderboard_data(results_files=results_files_to_load, baseline_scores=baseline_cv_scores)

eval.display_leaderboard(df_leaderboard, title="Stage 1 - High-level Leaderboard")

finalist_models = [
    "RidgeCV",
    "LinearRegression",
    "ElasticNetCV",
    "CatBoostRegressor",
    "LGBMRegressor",
]
                                         Stage 1 - High-level Leaderboard                                          
                               (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃      ┃                         ┃                 ┃             ┃               ┃              ┃ Avg CV MAPE (%) ┃
┃ Rank ┃ Model                   ┃  Model Family   ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃        ↓        ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│  #1  │ RidgeCV                 │     Linear      │   0.4219    │    0.9570     │    0.7514    │      2.68%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #2  │ LinearRegression        │     Linear      │   0.4185    │    0.9595     │    0.7523    │      2.68%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #3  │ ElasticNetCV            │     Linear      │   0.4088    │    0.9673     │    0.7596    │      2.71%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #4  │ CatBoostRegressor       │   Tree-based    │   0.3248    │    1.0258     │    0.8122    │      2.89%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #5  │ LGBMRegressor           │   Tree-based    │   0.2901    │    1.0485     │    0.8319    │      2.97%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #6  │ XGBRegressor            │   Tree-based    │   0.2370    │    1.0894     │    0.8629    │      3.07%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #7  │ Baseline (Persistence)  │    Baseline     │   0.1575    │    1.1479     │    0.8834    │      3.15%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #8  │ Prophet                 │  Advanced time  │   -1.2556   │    1.6671     │    1.3497    │      4.82%      │
│      │                         │     series      │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #9  │ LSTM                    │  Deep learning  │   -1.4927   │    1.9124     │    1.5530    │      5.46%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #10  │ SARIMAX                 │  Advanced time  │ -3118.7928  │    50.6686    │   44.6821    │     158.28%     │
│      │                         │     series      │             │               │              │                 │
└──────┴─────────────────────────┴─────────────────┴─────────────┴───────────────┴──────────────┴─────────────────┘

Looks rather surprising, isn't it? A naive reading would suggest linear models are superior. However, a deeper methodological analysis tells a different story:)

1. Linear models' unexpected dominance:

  • All three linear models (RidgeCV, LinearRegression, ElasticNetCV) form a tight, high-performing cluster at the top of the leaderboard, with an average CV R² of ~0.42. Other robust linear variants we tested, but not displayed here (e.g., Huber, RANSAC, Theilsen, etc.) showed similarly strong performance.
  • Interpretation: In a way, it proves that by explicitly creating features that encode seasonality, trend, and recent history, we've provided these simple models with the linear signals they needed to perform well. Their built-in regularization also proved effective at managing the feature set and preventing overfitting.

2. Tree-based models - Why so underwhelming?

  • The best tree-based models (CatBoostRegressor, LGBMRegressor) significantly outperform the naive baseline but fall short of the linear models, with an average R² of ~0.32. Other variants like RandomForest and AdaBoost performed even less competitively.
  • Hypothesis: This lower average score may be misleading. We hypothesize that these highly flexible models excel at capturing complex, non-linear signals for the short-term t+1 forecast but struggle to generalize on the weaker signals of longer t+3 to t+5 horizons, potentially overfitting and dragging down their average score. We will dive deeper in stage 2 to verify this.
  • Futhermore, we have to note that tree models can't extrapolate. They can interpolate well, but not extrapolate (due to how they work). Linear models, on the other hand, can extrapolate much better due to their nature, i.e. fitting a line $y = ax + b$, so any input that goes through the formula gets a reasonable value. More specifically, since most of the hottest days lies in the test set, not present in the train set (as seen from EDA - likely because of climate change), so when presented with new data points that lie outside their training range, they can't predict those values correctly.

3. Advanced time-series & deep learning models - Not suited for this task:

  • The specialized models (SARIMAX, Prophet, LSTM) failed catastrophically "out-of-the-box". Even a Seq2Seq model, with attention strategy, generous lookback windows, and an extensive feature set crafted just for these deep learning models (needs more features, especially those that can capture the "trend/memory" aspect long-term), while surpassing the baseline (no longer negative $R^2$ score) still can't match the lowest-performing tree-based model.
  • Interpretation: It's very likely that this isn't a "low result, model bad" case but a confirmation that they are not suitable for a generalist bake-off. These architectures require highly specific data preparation (e.g., stationarity for SARIMAX - which our weather data is not; 3D tensors for LSTM; etc.), extensive hyperparameter tuning, and often larger datasets than we have available. More advanced architectures tested (e.g., CNNs, Seq2Seq) also confirmed these poor results, meaning for this specific problem scale, their complexity is a disadvantage, not an advantage.

$\implies$ To stage 2: Based on these results, we will advance the following five models to a more granular, horizon-level and fold-level analysis to test our central hypothesis:

  1. RidgeCV
  2. LinearRegression
  3. ElasticNetCV
  4. CatBoostRegressor
  5. LGBMRegressor

Q5.4.2. Stage 2: Selected Models Deep-Dive

Q5.4.2. Stage 2: Selected Models Deep-Dive

The high-level leaderboard gave us a surprising result, but an average score can hide critical details. We must now test our hypothesis: Do tree-based models outperform linear models on the short-term t+1 forecast, even if their overall average is lower?

To do this, we will analyze the performance the best model from each model family: RidgeCV for linear models VS CatBoostRegressor for tree-based models.

In [27]:
df_horizon_data = eval.generate_finalist_horizon_data(
    results_files=results_files_to_load,
    finalists=finalist_models,
    metrics=["r2", "rmse", "mae"],
)
eval.display_finalist_deep_dive_plots(df_horizon_data)
  • Hypothesis refuted - Linear models absolutely dominates: The linear RidgeCV (blue) model outperforms the CatBoostRegressor (yellow) at every single forecast horizon, from t+1 through t+5, in every metrics.

  • (More) Consistent performance vs. sharp decay:

    • The RidgeCV model has a more graceful, predictable decay in performance as the forecast horizon increases.
    • In contrast, the CatBoostRegressor's performance (especially its R² score) falls off a cliff after t+2 (the line connecting the two models gets longer as the horizon increase, especially the sudden elongation at t+2 compared to t+1), indicating it struggles significantly to generalize on the weaker signals of longer-range forecasts.
  • Why??:

    • It seems that the relationship between the temp and the features we supply to the models are more linearly than initially assumed.
    • Our feature engineering (creating cyclical, lag, and rolling features) also captured most of the non-linear, more complex relationships.
    • The flexibility of the tree-based model may have actually been a disadvantage, leading it to overfit to noise in the training folds, especially for the more difficult long-range forecasts.
    • These tree-based models are still using the default out-of-the-box (OOB) parameters, aka. sklearn's defaults. We didn't tune them with Optuna, not even "lightly" tune them (guessing the potentially more optimal parameters compared to the default). We will test this later.
In [28]:
df_fold_data = eval.generate_finalist_fold_data(
    results_files=results_files_to_load,
    finalists=["RidgeCV", "CatBoostRegressor"],
    horizons_to_plot=["t+1", "t+5"],
)

fig_fold_plot = eval.display_finalist_fold_plot(df_fold_data)
fig_fold_plot.show()
  • The trend continues: Even when we break down performance by individual CV folds, the RidgeCV model (light blue) consistently outperforms or matches the CatBoostRegressor (dark blue) across both short and long horizons.

  • Fold 1 confirms concept drift: Both models perform poorly on the first fold, particularly for the long t+5 horizon where CatBoost's R² is negative, confirming the oldest data in our set is less predictive of the future.

  • Tree-based model has higher variance: The CatBoostRegressor's performance line is more erratic than the RidgeCV's, especially on the t+5 horizon. It suffers a significant performance drop between Fold 3 and Fold 4, while the linear model remains more stable. This suggests the untuned tree model is more sensitive to the specific data in each fold, a sign of higher variance and potential overfitting.

$\implies$ Single-model bake-off conclusion: for this specific feature set and with default hyperparameters, linear models are the superior single-model architecture. They are not only more accurate on average but also more stable across different time periods and forecast horizons. RidgeCV is our single-model champion.

However, this does not mean tree-based models have no value. Their ability to capture complex, non-linear interactions is a theoretically powerful asset. We will investigate if we can combine the stability of linear models with the potentially predictive power of tree-based models through ensembling. The errors that linear models make might be different to that of non-linear ones, so perhaps by combining them and averaging their predictions, we will get a model that is more stable and accurate more frequently.

Q5.4.4. Stage 3: Ensemble Model Bake-Off

Q5.4.4. Stage 3: Ensemble Model Bake-Off

While RidgeCV proved to be the most stable and accurate single model, our analysis suggests that different model families might capture different aspects of the underlying data. Tree-based models are adept at finding complex, non-linear interactions, while linear models excel at identifying stable, overarching trends. Ensembles might be able to leverage the strengths of both while mitigating their individual weaknesses.

We will test two ensemble strategies:

  1. Simple averaging ensemble: This model simply averages the predictions from RidgeCV and CatBoostRegressor. It's a straightforward test of whether their errors tend to cancel each other out.
  2. Stacking ensemble (with meta-learner): A more sophisticated approach. This model feeds the predictions of the base models (RidgeCV, CatBoostRegressor) into a second-level "meta-learner" (RidgeCV). The meta-learner's job is to learn the optimal, data-driven weights for combining the base predictions, potentially learning to trust one model more than the other depending on the forecast horizon.
In [29]:
results_files_to_load = [
    "./assets/saved_results/bakeoff_results_linear.pkl",
    "./assets/saved_results/bakeoff_results_tree.pkl",
    "./assets/saved_results/bakeoff_results_ensemble.pkl",
]

df_leaderboard_final = eval.generate_leaderboard_data(
    results_files=results_files_to_load,
    baseline_scores=baseline_cv_scores,
)
eval.display_leaderboard(df_leaderboard_final, title="Stage 3 - The Rise of Ensembles?")
                                         Stage 3 - The Rise of Ensembles?                                          
                               (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃      ┃                         ┃                ┃             ┃               ┃              ┃ Avg CV MAPE (%)  ┃
┃ Rank ┃ Model                   ┃  Model Family  ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃        ↓         ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│  #1  │ RidgeCV                 │     Linear     │   0.4219    │    0.9570     │    0.7514    │      2.68%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #2  │ LinearRegression        │     Linear     │   0.4185    │    0.9595     │    0.7523    │      2.68%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #3  │ SimpleAveragingEnsemble │ Ensemble (OOB) │   0.4175    │    0.9587     │    0.7540    │      2.69%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #4  │ ElasticNetCV            │     Linear     │   0.4088    │    0.9673     │    0.7596    │      2.71%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #5  │ ElasticNet_CatBoost_wit │ Ensemble (OOB) │   0.3839    │    0.9819     │    0.7741    │      2.76%       │
│      │ h_ElasticNetCV          │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #6  │ ElasticNetCV_LGBM_with_ │ Ensemble (OOB) │   0.3832    │    0.9819     │    0.7730    │      2.76%       │
│      │ RidgeCV                 │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #7  │ ElasticNet_CatBoost_wit │ Ensemble (OOB) │   0.3814    │    0.9836     │    0.7749    │      2.76%       │
│      │ h_RidgeCV               │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #8  │ RidgeCV_CatBoost_with_E │ Ensemble (OOB) │   0.3650    │    0.9931     │    0.7846    │      2.80%       │
│      │ lasticNeCV              │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #9  │ RidgeCV_CatBoost_with_R │ Ensemble (OOB) │   0.3627    │    0.9940     │    0.7853    │      2.80%       │
│      │ idgeCV                  │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│ #10  │ CatBoostRegressor       │   Tree-based   │   0.3248    │    1.0258     │    0.8122    │      2.89%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│ #11  │ LGBMRegressor           │   Tree-based   │   0.2901    │    1.0485     │    0.8319    │      2.97%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│ #12  │ XGBRegressor            │   Tree-based   │   0.2370    │    1.0894     │    0.8629    │      3.07%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│ #13  │ Baseline (Persistence)  │    Baseline    │   0.1575    │    1.1479     │    0.8834    │      3.15%       │
└──────┴─────────────────────────┴────────────────┴─────────────┴───────────────┴──────────────┴──────────────────┘
  • Linear models still dominate, but ensembles close the gap:

    • RidgeCV remains the single best-performing model on average. However, the SimpleAveragingEnsemble has proven to be highly competitive, placing #3 on the leaderboard with a score nearly identical to the top linear models.
    • This suggests combining models can be a powerful strategy, as the simple ensemble significantly outperforms its weakest member (CatBoostRegressor).
  • Stacking ensembles show potential:

    • The more complex stacking ensembles (using a meta-learner) consistently outperform the base tree models, demonstrating that a learned combination is superior to a single tree model.
    • However, their performance still lags behind the simple linear models. This suggests that with untuned base models, the meta-learner struggles to extract consistently superior signals, and its added complexity does not yet pay off.

We will investigate the performance per horizon, by each fold, of 3 representative models: RidgeCV, SimpleAveragingEnsemble, and ElasticNet_CatBoost_with_ElasticNetCV

In [30]:
df_fold_data = eval.generate_finalist_fold_data(
    results_files=results_files_to_load,
    finalists=["RidgeCV", "SimpleAveragingEnsemble", "ElasticNet_CatBoost_with_ElasticNetCV"],
    horizons_to_plot=["t+1", "t+2", "t+3", "t+4", "t+5"],
)

fig_fold_plot = eval.display_finalist_fold_plot(df_fold_data)
fig_fold_plot.show()
  • Ensembles excel at short-term forecasts: As hypothesized, both the simple and stacking ensembles clearly outperform the standalone RidgeCV model on the t+1 horizon, particularly in the later, more recent cross-validation folds (fold 3 to 5). This confirms that the non-linear CatBoost model is providing valuable, unique information for short-range predictions that the linear model cannot capture on its own.

  • Linear stability wins in the long-term:

    • From horizon t+2 onwards, the performance of the stacking ensemble (ElasticNet_CatBoost_with_ElasticNetCV) degrades significantly.
    • The SimpleAveragingEnsemble remains competitive, matching RidgeCV on t+2 but lags slighly behind from t+3 to t+5, and is consistently edged out by the stability of the standalone linear model (although the gap between them does not increase with the horizon).
  • The path forward - Unleashing potential with tuning:

    • The results strongly suggest that the weak link in our stacking ensembles is the underperformance of the untuned tree-based models, especially on longer horizons.
    • We hypothesize that by using hyperparameter optimization (tuning) on the tree-based components of our ensembles, we can significantly boost their long-range performance, potentially allowing the stacking architecture to surpass the simple linear models and become the better model.

$\implies$ Finalists for the tuning stage: Based on our findings so far, we will advance all four architectures to the hyperparameter optimization stage, but only choosing 1-2 representatives from each one:

  1. RidgeCV (the single-model benchmark), and ElasticCV (close performance, and it appears in the top stacking candidates)
  2. CatBoostRegressor (the best single tree-model), along with LGBMRegressor (due to the speed and potential gains in tuning)
  3. SimpleAveragingEnsemble (the best, "simple" ensemble)
  4. ElasticNetCV_CatBoost_with_ElasticNetCV (the best stacking candidate), and ElasticNetCV_LGBM_with_RidgeCV (for diversity, speed, and also the potential gains of tuned LGBM)

Q5.5. Hyperparameter Optimization

Q5.5. Hyperparameter Optimization

Our bake-off revealed that with default settings, linear models outperformed tree-based models. However, gradient boosting models' performance is often improved through careful hyperparameter optimization (HPO).

We will use Optuna, a state-of-the-art HPO framework, with the following setup:

  • Objective: weighted cross-validation score

    • Not simply to maximize the average R² score
    • As noted in EDA, and from the above CV folds analysis, we know recent data is more predictive - as weather is non-stationary. Therefore, we define the objective function to maximize a weighted average of the TimeSeriesSplit R² scores, using weights [0.10, 0.15, 0.20, 0.25, 0.30] for folds 1 through 5, respectively.
    • It tells the optimizer to prioritize models that perform well on the most recent data.
    • The weights differences are not huge, and they make sense:
      • Even though fold 1 is normally an anomaly (due to the smallest training size and sensitivity to concept drift), its weight is not too low to be ignored by the optimizer - which would defeat the purpose of CV.
      • Likewise, fold 5 doesn't have that much higher weight, and it represents our model performance most closely when it is given enough amount of data to train on, so we will give it a slightly higher weight.
  • Efficient search:

    • Persistent storage: All study results are saved to SQLite databases .db files, allowing us to resume long-running jobs(!!), and use optuna-dashboard to monitor the tuning process, which proved to be invaluable in iteratively refining our decisions.
    • Intelligent sampling (TPESampler): An intelligent algorithm that learns from past trials to guide its search towards more promising hyperparameter regions. In simple words: the trials get better and better.
    • Aggressive pruning (SuccessiveHalvingPruner): Terminates unpromising trials early. We set min_resource=3, meaning a trial must prove its worth on a weighted average of the first three CV folds before it is allowed to continue. If after 3 folds, the pruner evaluates a trial and that trial's performance is in the bottom fraction, it terminates the trial, freeing up resource for new, potential trials to run.
    • Iteratively refined search spaces: For each model we want to tune:
      • First, we run some short trials with wide search space
      • We then analyze these trials (optuna-dashboard):
        • Most importantly, we look at the hyperparameters importance (both fANOVA and PED_ANOVA evaluators, but emphasize much more on the former), and focus on the top few HPs
        • For each of those HPs, we look at the bivariate slice plot between our objective value (weighted CV $R^2$ score) and the values of that HP for every trials ran
        • We would see a value range of the HP where it scores higher than the rest, we update the search space (narrow it down) to that range, usually with a bit of buffer for the 2 boundaries.
        • After we finish narrowing down the range for the first horizon, we repeat with the remaining 4 horizons. When we're done, we can run the tuning process again for that model, and can confidently increase the number of trials with much more efficient and focused tuning values, helping us cuts down the time and increase the potential gains instead of just casting a wide net and run a humongous amount of trials:)
  • Each per-horizon tuning study was run 2-5 times independently, ensuring better chance of a global optimum instead of a local one.

Q5.5.1. Tuning Single Tree Models

Q5.5.1. Tuning Single Tree Models

First, we tune our best-performing tree models (CatBoostRegressor and LGBMRegressor) as standalone candidates, trying to find the optimal set of hyperparameters for each of the five forecast horizons individually, using the weighted CV objective.

This shows the maximum performance ceiling for a standalone tree model.

In [31]:
results_files_to_load = [
    "./assets/saved_results/bakeoff_results_tree.pkl",
    "./assets/saved_results/evaluation_results_tuned_single_models_v2_weighted_CV.pkl",
]

df_leaderboard_final = eval.generate_leaderboard_data(results_files=results_files_to_load)
eval.display_leaderboard(df_leaderboard_final, title="Optuna on Single Models")
                                             Optuna on Single Models                                              
                              (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Rank ┃ Model                   ┃ Model Family ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃ Avg CV MAPE (%) ↓ ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│  #1  │ Tuned_LGBM_v2_weighted_ │  Tuned tree  │   0.4036    │    0.9674     │    0.7625    │       2.72%       │
│      │ CV                      │              │             │               │              │                   │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #2  │ Tuned_CatBoost_v2_weigh │  Tuned tree  │   0.3835    │    0.9822     │    0.7761    │       2.77%       │
│      │ ted_CV                  │              │             │               │              │                   │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #3  │ CatBoostRegressor       │  Tree-based  │   0.3248    │    1.0258     │    0.8122    │       2.89%       │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #4  │ LGBMRegressor           │  Tree-based  │   0.2901    │    1.0485     │    0.8319    │       2.97%       │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #5  │ XGBRegressor            │  Tree-based  │   0.2370    │    1.0894     │    0.8629    │       3.07%       │
└──────┴─────────────────────────┴──────────────┴─────────────┴───────────────┴──────────────┴───────────────────┘
  • Significant performance lift: Dramatic impact for the single tree models - Tuned_LGBM_v2 and Tuned_CatBoost_v2 show a massive improvement over their untuned (OOB) counterparts, with their average CV R² scores jumping from ~0.29-0.32 to a much more competitive ~0.38-0.40.

  • During the tuning process, we noticed a few things (we used optuna-dashboard to monitor the trials in real-time with their hyperparameters values and importances):

    • Across both models and all horizons, the tuning process consistently favored a "many weak learners" strategy to avoid overfitting
    • Low learning_rate: Forces the model to make small, incremental improvements (both models favored eta below 0.02).
    • Moderate-high n_estimators/iterations: A large number of trees are needed to converge with a low learning rate (around 300-1200, and the number decreases as the horizon increases).
    • Shallow max_depth: Simple, shallow trees are used as the building blocks, preventing the model from learning spurious noise in the data (most have max depth of 3-5).
    • Horizon-specific adaptation: Longer horizons (e.g., t+5), where the signal is weaker, consistently favored even simpler trees and more aggressive regularization than their short-horizon counterparts.
  • You might've noticed the _v2. Why v2, where's v1, is there a v3?? We also tried tuning these same models, but using the default average CV $R^2$ score across all 5 folds instead of the weighted one. These simple mean ones are the _v1s, and looking at the difference between them:

In [32]:
results_files_to_load = [
    "./assets/saved_results/evaluation_results_tuned_single_models_v1_mean.pkl",
    "./assets/saved_results/evaluation_results_tuned_single_models_v2_weighted_CV.pkl",
    "./assets/saved_results/evaluation_results_tuned_LGBM_v3_linear_tree.pkl"
]

df_leaderboard_final = eval.generate_leaderboard_data(results_files=results_files_to_load)
eval.display_leaderboard(df_leaderboard_final, title="Average CV vs. Weighted CV optimization objective")
                                Average CV vs. Weighted CV optimization objective                                 
                              (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Rank ┃ Model                   ┃ Model Family ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃ Avg CV MAPE (%) ↓ ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│  #1  │ Tuned_LGBM_v2_weighted_ │  Tuned tree  │   0.4036    │    0.9674     │    0.7625    │       2.72%       │
│      │ CV                      │              │             │               │              │                   │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #2  │ Tuned_LGBM_v3_linear_tr │  Tuned tree  │   0.3938    │    0.9773     │    0.7674    │       2.74%       │
│      │ ee                      │              │             │               │              │                   │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #3  │ Tuned_LGBM_v1_mean      │  Tuned tree  │   0.3924    │    0.9749     │    0.7693    │       2.75%       │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #4  │ Tuned_CatBoost_v2_weigh │  Tuned tree  │   0.3835    │    0.9822     │    0.7761    │       2.77%       │
│      │ ted_CV                  │              │             │               │              │                   │
├──────┼─────────────────────────┼──────────────┼─────────────┼───────────────┼──────────────┼───────────────────┤
│  #5  │ Tuned_CatBoost_v1_mean  │  Tuned tree  │   0.3747    │    0.9901     │    0.7827    │       2.79%       │
└──────┴─────────────────────────┴──────────────┴─────────────┴───────────────┴──────────────┴───────────────────┘
  • Compared to their _v2 counterparts, they lagged behind by almost 0.01 $R^2$ score for both models, even though other metrics' differences are less pronounced

  • LGBM, even with the default average maximizing objective, still outperforms CatBoost with weighted CVs, suggesting that the gains potential of CatBoost for this dataset, with our features, is lower compared to LGBM.

  • Interestingly, LGBMRegressor also has a boolean parameter of linear_tree, where if set to True, it basically becomes a piecewise linear model:

    • It turns the leaf nodes of the trees from a constant value (prediction) into a simple linear regression model (for the training data/samples that fall into that leaf)
    • Since we know tree models can't extrapolate but linear models can, this should, ***theoretically*, gives our model the best of both worlds, increasing its performance, especially on later horizons (an advantage of linear models over tree-based ones, as seen above). So, **we tried tuning one, designated _v3 in the leaderboard above (also with weighted CV, so it's a direct comparision to _v2).
    • But nah:) when compared to the version with linear_tree=False (default), it actually performs worse on every metrics (averaged across 5 horizons on the CV train set)
    • Looking more deeply, however, it has a very slight improvements for horizons t+3 to t+5, mostly on fold 1 of these later horizons (where we know linear models perform much better). Later folds see no statistically significant difference at all. And it just fails completely on t+1 - the horizon where the relationships are the most complex and thus non-linear models are usually better than linear ones, and while this one is still LGBM (a non-linear model), the use of linear regressors as leaf nodes have made it lose its inherent strength.
    • All in all, great idea on paper (tree models but with the power of linear models integrated? Sweet!), ***inefficacious* in practice** (at least for our case).
In [33]:
results_files_to_load = [
    "./assets/saved_results/bakeoff_results_linear.pkl",
    "./assets/saved_results/bakeoff_results_tree.pkl",
    "./assets/saved_results/evaluation_results_tuned_single_models_v2_weighted_CV.pkl",
    "./assets/saved_results/bakeoff_results_ensemble.pkl",
]

df_fold_data = eval.generate_finalist_fold_data(
    results_files=results_files_to_load,
    finalists=["RidgeCV", "SimpleAveragingEnsemble", "CatBoostRegressor", "Tuned_LGBM_v2_weighted_CV"],
    horizons_to_plot=["t+1", "t+5"],
)

fig_fold_plot = eval.display_finalist_fold_plot(df_fold_data)
fig_fold_plot.show()

We chose:

  • CatBoostRegressor (dark blue) as the best untuned tree model
  • Tuned_LGBM_v2_weighted_CV as the best tuned tree model (red)
  • RidgeCV as the current top performer (light blue)
  • SimpleAveragingEnsemble as the top ensemble (yellow)

$\implies$ Exciting improvements for the tuned tree models:

  • For 1 and 2-day-ahead forecast, all models are close to each other, and RidgeCV actually comes last in fold 4 and 5 of t+1, even when compared to the untuned CatBoost, showing that the non-linear models are better at capturing the complex signals in t+1 given enough data.
  • For later horizons, RidgeCV still comes out on top, but the tuned LGBM and the simple ensemble follows very closely, a drastic boost from the untuned CatBoost.

Now, we will try tuning the ensembles, and see if it can really unlock the potential to surpass RidgeCV and crown a new champion model.

Q5.5.2. Tuning Ensembles

Q5.5.2. Tuning Ensembles

In [34]:
results_files_to_load = [
    "./assets/saved_results/bakeoff_results_linear.pkl",
    "./assets/saved_results/bakeoff_results_ensemble.pkl",
    "./assets/saved_results/evaluation_results_tuned_ensembles_test.pkl",
]

df_leaderboard_final = eval.generate_leaderboard_data(results_files=results_files_to_load)
eval.display_leaderboard(
    df_leaderboard_final,
    title="Ensembles - OOB vs Optuna",
    highlight_models=["Avg_Ridge_TunedLGBM", "SimpleAveragingEnsemble", "Stack_ElasticNet_LGBM_Ridge_Tuned", "RidgeCV"],
    exclude_models=["LinearRegression", "ElasticNetCV"],
)
                                             Ensembles - OOB vs Optuna                                             
                               (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃      ┃                         ┃                 ┃             ┃               ┃              ┃ Avg CV MAPE (%) ┃
┃ Rank ┃ Model                   ┃  Model Family   ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃        ↓        ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│  #1  │ Avg_Ridge_TunedLGBM     │  Tuned simple   │   0.4371    │    0.9432     │    0.7402    │      2.64%      │
│      │                         │       avg       │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #2  │ Avg_Ridge_TunedCatBoost │  Tuned simple   │   0.4331    │    0.9463     │    0.7437    │      2.65%      │
│      │                         │       avg       │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #3  │ Avg_ElasticNet_TunedLGB │  Tuned simple   │   0.4287    │    0.9498     │    0.7468    │      2.66%      │
│      │ M                       │       avg       │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #4  │ Avg_ElasticNet_TunedCat │  Tuned simple   │   0.4224    │    0.9546     │    0.7515    │      2.68%      │
│      │ Boost                   │       avg       │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #5  │ RidgeCV                 │     Linear      │   0.4219    │    0.9570     │    0.7514    │      2.68%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #7  │ SimpleAveragingEnsemble │ Ensemble (OOB)  │   0.4175    │    0.9587     │    0.7540    │      2.69%      │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #9  │ Stack_ElasticNet_LGBM_R │ Tuned stacking  │   0.4054    │    0.9663     │    0.7613    │      2.71%      │
│      │ idge_Tuned              │                 │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #10  │ Stack_ElasticNet_CatBoo │ Tuned stacking  │   0.3926    │    0.9757     │    0.7688    │      2.74%      │
│      │ st_Ridge_Tuned          │                 │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #11  │ ElasticNet_CatBoost_wit │ Ensemble (OOB)  │   0.3839    │    0.9819     │    0.7741    │      2.76%      │
│      │ h_ElasticNetCV          │                 │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #12  │ ElasticNetCV_LGBM_with_ │ Ensemble (OOB)  │   0.3832    │    0.9819     │    0.7730    │      2.76%      │
│      │ RidgeCV                 │                 │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #13  │ ElasticNet_CatBoost_wit │ Ensemble (OOB)  │   0.3814    │    0.9836     │    0.7749    │      2.76%      │
│      │ h_RidgeCV               │                 │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #14  │ RidgeCV_CatBoost_with_E │ Ensemble (OOB)  │   0.3650    │    0.9931     │    0.7846    │      2.80%      │
│      │ lasticNeCV              │                 │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│ #15  │ RidgeCV_CatBoost_with_R │ Ensemble (OOB)  │   0.3627    │    0.9940     │    0.7853    │      2.80%      │
│      │ idgeCV                  │                 │             │               │              │                 │
└──────┴─────────────────────────┴─────────────────┴─────────────┴───────────────┴──────────────┴─────────────────┘
  • Simple averaging ensembles are the clear best performers:

    • All 4 Tuned Simple Avg ensembles occupy the top 4 spots on the leaderboard, even the "weakest" one surpasses the best single linear model - previous top performer.
    • The new best performer is Avg_Ridge_TunedLGBM, achieving the highest average CV R² of 0.4371:
      • Formed by combining the best linear model (RidgeCV) with the best tuned tree model (LGBM)
      • It combines the stability and extrapolation ability of a linear model with a powerful, finely-tuned tree model that's able to capture more complex relationships, creating a model that is better than either base models alone.
      • However, the combination of RidgeCV and tuned CatBoost is a very close second, even though the single tuned CatBoost was substantially lower than tuned LGBM, suggesting that the errors of Ridge and Catboost cancels each other out nicely. We can potentially fine-tune CatBoost more and might have an even better top model.
  • Again, tuning provides a substantial performance lift:

    • The tuned simple average ensembles improves greatly compared to their OOB counterparts. For example, the SimpleAveragingEnsemble (which is a combination of Ridge and base CatBoost) improves from 0.4175 $R^2$ score to 0.4331, a 3.74% relative improvement. All other metrics are also better, improved ranging from ~1.30% to 1.50% compared to OOB.
    • The tuned stacking ensembles also increased compared to their OOB states, although not as drastic as the tuned simple average ones.
  • Stacking ensembles - Not worth their complexity for our case:

    • While the holistically tuned stacking ensembles improved, they still couldn't match the performance of the simpler averaging method.
    • Perhaps the data we feed to the meta-learners isn't sufficient for it to able to reliably learn better weights to assign each base model than a simple average, or no clear trend, so the meta-learner is more prone to overfitting.

$\implies$ So, what if we try assigning static weights specific for each horizon?

  • Since we know that tree models perform better than linear models for t+1, but in later horizons t+3 to t+5, the superior performance is held by linear models, so we could try assigning higher weights to tree models in the first 1 or 2 days, then decrease for later horizons.
  • For each forecast horizon, we performed a weights grid search during CV (from 0.0 to 1.0 in 0.05 increments) to find the optimal blend between the linear and tree model, averaging these best weights across all 5 folds.
In [35]:
results_files_to_load = [
    "./assets/saved_results/evaluation_results_tuned_ensembles_test.pkl",
    "./assets/saved_results/evaluation_results_tuned_ensemble_weighted_avg.pkl",
]

df_leaderboard_final = eval.generate_leaderboard_data(results_files=results_files_to_load)
eval.display_leaderboard(
    df_leaderboard_final,
    title="Simple 50/50 vs Static Weighted-horizon Ensembles",
    exclude_models=[
        "Avg_Ridge_TunedCatBoost",
        "Avg_ElasticNet_TunedLGBM",
        "Avg_ElasticNet_TunedCatBoost",
        "Stack_ElasticNet_LGBM_Ridge_Tuned",
        "Stack_ElasticNet_CatBoost_Ridge_Tuned",
    ],
)
                                 Simple 50/50 vs Static Weighted-horizon Ensembles                                 
                               (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃      ┃                         ┃                 ┃             ┃               ┃              ┃ Avg CV MAPE (%) ┃
┃ Rank ┃ Model                   ┃  Model Family   ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃        ↓        ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│  #1  │ WeightedAvg_Ridge_Tuned │  Tuned simple   │   0.4416    │    0.9398     │    0.7405    │      2.64%      │
│      │ LGBM                    │       avg       │             │               │              │                 │
├──────┼─────────────────────────┼─────────────────┼─────────────┼───────────────┼──────────────┼─────────────────┤
│  #2  │ Avg_Ridge_TunedLGBM     │  Tuned simple   │   0.4371    │    0.9432     │    0.7402    │      2.64%      │
│      │                         │       avg       │             │               │              │                 │
└──────┴─────────────────────────┴─────────────────┴─────────────┴───────────────┴──────────────┴─────────────────┘
    Optimal Blending Weight for Linear Model (RidgeCV)     
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Horizon                           ┃ Avg. Optimal Weight ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ (WeightedAvg_Ridge_TunedLGBM) t+1 │        0.40         │
│ (WeightedAvg_Ridge_TunedLGBM) t+2 │        0.39         │
│ (WeightedAvg_Ridge_TunedLGBM) t+3 │        0.53         │
│ (WeightedAvg_Ridge_TunedLGBM) t+4 │        0.50         │
│ (WeightedAvg_Ridge_TunedLGBM) t+5 │        0.58         │
└───────────────────────────────────┴─────────────────────┘
In [36]:
# df_fold_data = eval.generate_finalist_fold_data(
#     results_files=results_files_to_load,
#     finalists=["WeightedAvg_Ridge_TunedLGBM", "Avg_Ridge_TunedLGBM"],
#     horizons_to_plot=["t+1", "t+2", "t+3", "t+4", "t+5"],
# )

# fig_fold_plot = eval.display_finalist_fold_plot(df_fold_data)
# fig_fold_plot.show()
  • The new WeightedAvg_Ridge_TunedLGBM performs better, achieving the highest average CV R² of 0.4416 and the lowest RMSE of 0.9398.
  • The optimal weights confirm our hypothesis, since the ensemble favors the tree model for short horizons (t+1, t+2 weight ~0.40 linear) but shifts reliance to the stabler linear model for longer horizons (t+5 weight 0.58 linear).
  • While R² and RMSE improved, MAE slightly degraded (0.7402 $\rightarrow$ 0.7405), indicating the weighted model is better at avoiding large errors (which RMSE penalizes heavily) but is ***(very)* marginally less precise on average**.
  • Upon closer inspection, most of the gains in $R^2$ score increase comes from fold 1 of horizons t+3 to t+5 due to less reliance on the inferior LGBM of fold 1 on later horizons. While the gains on fold 1 likely wouldn't contribute to an increase in real-world, production usage (or on the test set), it means our model is more robust in more extreme cases.

$\implies$ Conclusion:

  • With that, our top performer is the WeightedAvg_Ridge_TunedLGBM model, a "simple" horizon-weighted average between a linear model RidgeCV and a tuned tree model LGBM
  • However, the performance of RidgeCV, a simple single linear model that requires no tuning or complex logic, trains really fast, and much more interpretable (due to it being a single linear model, and operating on a set of only 41 features), RidgeCV will be our second choice, especially in applications that the requirements for absolute accuracy is lower but favor speed, simplicity/interpretability.

Q5.5.3. Hyperparameter Importance Analysis

Q5.5.3. Hyperparameter Importance Analysis

In [37]:
df_importance_lgbm = joblib.load("./assets/saved_results/df_importance_lgbm.pkl")

fig_bump_chart = eval.plot_importance_bump_chart(df_importance_lgbm)
fig_bump_chart.show()
  • The chart confirms that the model intelligently adapts its priorities based on the forecast horizon, a direct response to the decreasing signal-to-noise ratio.

  • For short-term forecasts (t+1):

    • The model focuses on complexity and capacity, more specifically, many weak learners over few complex, deep ones.
    • n_estimators (number of trees) and num_leaves (complexity of each tree) are the most critical parameters. Their values are also higher, almost double that of horizon t+4, t+5.
    • The primary challenge is fitting the strong, clear signal accurately.
  • For long-term forecasts (t+4, t+5):

    • The model's priorities completely invert to focus on regularization and simplicity, most notably by low max_depth and regularization to prevent overfitting due to the low singal-to-noise ratio.
    • max_depth becomes, by far, the most dominant hyperparameter, responsible for 37% of the performance variance. Optuna scores the highest with max_depth of 2 for horizon 3 onwards, over and over again, compared to 5 of the first horizon.
    • Parameters like learning_rate and lambda_l1 also surge in importance, with learning_rate value almost double that of horizon 1 to 3 (~0.01 compared to ~0.005).

Q5.6. Chosen Models Diagnostics

Q5.6. Chosen Models Diagnostics

Before proceeding to the final test set evaluation, we must perform a deep diagnostic analysis of our chosen models on the cross-validation data. This allows us to understand their failure modes, uncover systematic biases, and build confidence in their behavior.

Our two finalists are:

  • Champion: WeightedAvg_Ridge_TunedLGBM
  • Alternative: RidgeCV

We will perform all diagnostics comparatively to understand their relative strengths and weaknesses.

Q5.6.1. Residuals vs. Time

Q5.6.1. Residuals vs. Time

The first and most fundamental residual check is to plot the errors over time. A perfect model would produce residuals that are a random cloud of points centered on zero, with no discernible pattern. Any clear pattern indicates that the model is failing to capture some time-dependent information. We use a LOESS trendline to help visualize any low-frequency, systematic bias over time. The easiest way to check is to look at the 95% confidence interval of the LOESS trendline and see if it contains the y=0 line.

In [38]:
diagnostic_data = joblib.load("./assets/saved_results/chosen_models_diagnostic_data.pkl")

# fig_resid_vs_time = diag.plot_residuals_vs_time(
#     diagnostic_data=diagnostic_data,
#     horizons_to_plot=["t+1", "t+5"],
# )
fig_resid_vs_time = pio.read_json("./assets/plots/fig_resid_vs_time.json")
fig_resid_vs_time.show()
  • Both models are largely unbiased: At a high level, the residuals for both models are tightly centered around the zero line. The LOESS trends generally stay within a narrow +/- 0.5°C band, indicating no major, runaway systematic errors. Zoom in the plots to see the range of the LOESS lind and its CI interval clearer

  • Subtle, time-dependent biases are present: The 95% confidence intervals reveal periods of statistically significant bias where the shaded area does not overlap with the y=0 line.

  • t+1 horizon (Short-range): The simpler RidgeCV is marginally more stable:

    • The RidgeCV model (bottom left) exhibits a shorter period of significant bias (approx. late 2019 to late 2020), with its average error peaking at around +0.15°C.
    • The Champion Ensemble, while having a slightly better overall R² score, shows a longer period of bias (approx. mid-2018 to mid-2020) and its peak average error reaches a higher +0.20°C.
    • This suggests that for the t+1 forecast, the linear model's stability gives it a slight edge in long-term calibration.
  • t+5 horizon (Long-range): The ensemble demonstrates superior robustness:

    • The roles reverse for the more difficult 5-day forecast. The RidgeCV model (bottom right) shows multiple, prolonged periods of significant bias, with its average error swinging between approximately -0.3°C and +0.4°C.
    • The Champion Ensemble (top right), while also showing bias, is comparatively more stable. Its average error trend has a smaller overall range, and it has fewer extended periods where the 95% CI is entirely detached from the zero line.

$\implies$ Conclusion: This analysis provides crucial evidence for our champion selection. While RidgeCV is an incredibly strong and stable baseline for short-term forecasts, the WeightedAvg_Ridge_TunedLGBM ensemble demonstrates superior robustness and lower systematic bias on the more challenging long-range forecasts, justifying its added complexity.

Q5.6.2. Autocorrelation of Residuals (ACF/PACF)

Q5.6.2. Autocorrelation of Residuals (ACF/PACF)

This answers the question: "Are the model's errors predictable?" A well-specified model should produce residuals that are indistinguishable from white noise, meaning the error on one day should not predict the error on the next. We check this by plotting the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) of the residuals.

  • ACF: Measures the correlation of the residual series with its own lagged values.
  • PACF: Measures the correlation between a residual and a lagged value after removing the effect of all intervening lags.

If our models have successfully captured the temporal patterns in the data, we expect to see all correlation spikes (the colored dots) fall within the grey 95% confidence interval, especially for the first few lags. We focus here on the t+1 horizon, where temporal patterns are strongest.

In [39]:
fig_acf_pacf = diag.plot_acf_pacf(
    diagnostic_data=diagnostic_data,
    horizons_to_plot=["t+1"],
)

fig_acf_pacf.show()
  • Excellent performance from both models: The residuals for both models are very close to white noise. Almost all autocorrelation spikes for lags greater than 1 fall well within the 95% confidence interval.

  • It shows that the extensive lag and rolling window features we created have successfully captured and explained the vast majority of the temporal dependencies in the data.

  • Minor remaining Lag 1 autocorrelation:

    • Both models exhibit a single, statistically significant spike at Lag 1 in both the ACF and PACF plots.
    • Indicates that there is a small but predictable relationship between the error on one day and the error on the previous day. For example, if the model overpredicted yesterday's temperature, it's slightly more likely to overpredict today's as well.
    • This is very common in time-series forecasting, suggesting a tiny amount of "error momentum" that the models did not capture. However, the magnitude of the correlation is small (around 0.1-0.15), meaning the practical impact of this uncaptured pattern is minimal.

$\implies$ Models don't suffer from major misspecification, their errors are almost entirely random and unpredictable. Good.

Q5.6.3. Residuals vs. Predicted Values & Key Features

Q5.6.3. Residuals vs. Predicted Values & Key Features

Next, we check for two other types of systematic bias:

  1. Bias related to the prediction magnitude: Does the model make different kinds of errors when predicting high temperatures versus low temperatures? A curved LOESS trend in a "Residuals vs. Predicted" plot would indicate such a bias (e.g., consistently under-predicting the peak of heatwaves).
  2. Conditional bias: Is the model's error dependent on the value of a specific input feature? For example, is it consistently wrong on days with zero solar radiation? We check this by plotting residuals against our most important predictors.
In [40]:
fig_resid_vs_pred = diag.plot_residuals_vs_variable(
    diagnostic_data=diagnostic_data,
    horizon_to_plot="t+1",
    x_variable="predicted",
)
fig_resid_vs_pred.show()
  • Largely unbiased performance: For the majority of predictions (between ~26.5°C and 30°C), the LOESS trendlines for both models are extremely close to the zero line, indicating no significant systematic bias in the most common temperature range.

  • Minor bias at the extremes:

    • High temperatures (Heatwaves): Both tend to under-predict for the highest temperatures (the trendline curves upwards for predicted values > 30°C) - a common model behavior, suggesting a conservative bias when forecasting extreme values.
    • Low temperatures: Subtle bias in opposite directions - ensemble tends to slightly over-predict the lowest temperatures (trendline below zero) while RidgeCV slightly under-predicts them.

$\implies$ While not perfectly flat, the trendlines are very close to zero, and the maximum average biases are well within a +/- 0.5°C range. It indicates models are well-calibrated and not suffering from significant, prediction-dependent bias. The minor biases at the extremes is something we can (try) to improve later.

We can also check this type of systematic bias with each variable too. We will choose solarradiation, as it's the feature most correlated with our target variable temp that isn't related to temperatures (i.e. feelslike, tempmax, etc.).

In [41]:
fig_resid_vs_variable = diag.plot_residuals_vs_variable(
    diagnostic_data=diagnostic_data,
    horizon_to_plot="t+1",
    x_variable="solarradiation",
)
fig_resid_vs_variable.show()
  • No strong conditional bias: Again, the trendlines of both models stay very close to zero for the entire range of the solarradiation values.
  • Shows that the models' accuracy is not dependent on the amount of solar radiation. They are just as accurate on cloudy, low-radiation days as they are on bright, sunny days.
  • This pattern of having no strong conditional bias holds true for other key predictors as well (we checked with 8 other key features, just not shown here).

$\implies$ Confirms the models are not biased by the value of any single predictor, and we can be more confident about their robustness.

Q5.6.4. Model Belief vs. Actual Impact (SHAP vs. Permutation)

Q5.6.4. Model Belief vs. Actual Impact (SHAP vs. Permutation)

We try to answer the question: "Does what our model thinks is important internally align with what actually drives its performance in practice?" To do this, we compare two SOTA feature importance techniques.

  • 1. Weighted SHAP (What the model thinks is important):

    • What it is: SHAP (SHapley Additive exPlanations) is a game-theory-based method that calculates the marginal contribution of each feature to every single prediction. We calculate the average absolute SHAP value for each feature from both the linear (RidgeCV) and tree (LGBM) components of our ensemble, then combine them using their respective horizon weights.
    • What it tells us: Represents the model's internal logic. It shows which features the model paid the most attention to when constructing its predictions, based on its learned coefficients and tree structures.
  • 2. Permutation importance (Actual performance impact):

    • What it is: Directly measures a feature's importance by randomly shuffling its values (so that feature at time t is no longer matched to t+1 target variable anymore) and observing the resulting drop in the model's performance (R² score).
    • What it tells us: Measures how much the model's real-world predictive accuracy actually depends on that feature. A large performance drop means the feature is critical.

How to interpret the plot?

The plot below is a 2x2 grid comparing these two methods for the chosen ensemble (WeightedAvg_Ridge_TunedLGBM) on the t+1 and t+5 horizons.

  • Rows: Represent the forecast horizon (t+1 vs. t+5).
  • Left plot (SHAP): Shows the model's internal belief.
  • Right plot (Permutation): Shows the actual performance impact.
  • Y-axis: The features are sorted by their SHAP importance (left column), and both plots use this same features ordering so we can directly compare horizontally between the 2 plots.

A great, highly interpretable model would show strong alignment between the two columns. Significant discrepancies — where a feature has high SHAP value but low permutation importance, or vice versa — can reveal where the model's internal logic might be overly complex or reliant on non-critical features.

In [42]:
shap_vs_perm_data = joblib.load("assets/saved_results/shap_vs_perm_data.pkl")

fig_shap_vs_perm = diag.plot_shap_vs_perm_comparison(
    shap_vs_perm_data=shap_vs_perm_data,
    horizons_to_plot=["t+1", "t+5"],
    n_top_features=20,
    sort_by="shap",
)
fig_shap_vs_perm.show()
  • An adaptive model: the features our champion ensemble relies on change dramatically depending on the forecast horizon.

  • For the short-term (t+1) forecast:

    • The model acts as a "state-based" predictor. Both SHAP (belief) and Permutation (impact) identify feelslike as the single most dominant feature.
    • Belief vs. impact divergence: While the model "looks at" other temperature features like tempmin and tempmax (high SHAP scores), their actual performance impact is low. This is due to multicollinearity; their information is already captured by feelslike. Which suggests the model is robust.
    • temp_lag_1, while still in the top 20, is far below compared to feelslike. This is likely because feelslike also capture other information like humidity, dew, etc. and is more stable of a predictor for tomorrow's temperature compared to the sensor-measured temperature of yesterday.
  • For the long-term (t+5) forecast:

    • The model completely shifts its strategy, now more of a "context-based" predictor.
    • The top of both charts are now dominated by the engineered temporal and rolling features, but more notably, the longer windows: windspeed_roll_14d_mean, sin_day_of_year, and temp_roll_28d_mean.
    • Validates the feature engineering work, proving that to predict far into the future, the model correctly learns to ignore noisy, short-term data and rely on stable, long-term seasonal and trend signals.
  • On the engineered features:

    • High-value features: The rolling window and cyclical/Fourier features are undeniably the most valuable predictors for long-range forecasting.
    • Lower-value features: Our text-based (SVD) and domain-specific (daylight_hours, temp_range) features consistently rank low in importance for this model. Their signal appears to be weaker, or redundant compared to the primary temporal and atmospheric features.

$\implies$ Conclusion: The model is able to correctly identifies the more powerful predictors for both short and long-range tasks. The alignment between its internal belief (SHAP) and its actual performance impact (Permutation) on the most critical features is very strong. It's interpretable and more importantly, we can trust that what it thinks would be what is for the most important features.

Q5.7. Ensemble Model Enhancement Experiments

Q5.7. Ensemble Model Enhancement Experiments

As seen from previous analysis, we have revealed minor, systematic biases in the chosen champion model (e.g., under-predicting extreme heat, small Lag 1 error autocorrelation). In this final experimental step, we test several enhancement strategies to see if we can correct these biases and achieve a final performance boost.

All enhancements are tested using rigorous, weighted CV on the training set. We ensured that the logic for these enhancements does not leak information from the validation fold back into the training process. We achieve this with a nested, pipeline-based approach within each TimeSeriesSplit fold:

  • Isotonic calibration: For each CV fold, the training data is further split into a sub-training set and a calibration set. The base ensemble is trained on the sub-training data, and the calibrator then learns its correction mapping on the held-out calibration data. This ensures the calibration rule is learned without seeing the final validation data for that fold.

  • Residual stacking: The primary model is trained on the training portion of the fold. Its errors (residuals) on that same training data are then used as the target for a secondary "error model." This two-stage pipeline is then used to make a final prediction on the unseen validation fold. At no point does the error model train on the validation set's residuals.

We will compare the performance of these enhanced models against our unenhanced champion.

In [43]:
df_enhancement_leaderboard = eval.generate_leaderboard_data(
    results_files=["./assets/saved_results/evaluation_results_ensemble_enhancements.pkl"]
)
eval.display_leaderboard(
    df_leaderboard=df_enhancement_leaderboard,
    title="Champion Ensemble: Base vs Enhanced",
)
                                        Champion Ensemble: Base vs Enhanced                                        
                               (Metrics averaged across all 5 horizons & 5 CV folds)                               
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃      ┃                         ┃                ┃             ┃               ┃              ┃ Avg CV MAPE (%)  ┃
┃ Rank ┃ Model                   ┃  Model Family  ┃ Avg CV R² ↑ ┃ Avg CV RMSE ↓ ┃ Avg CV MAE ↓ ┃        ↓         ┃
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│  #1  │ Champion (Unenhanced)   │    Unknown     │   0.4416    │    0.9398     │    0.7405    │      2.64%       │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #2  │ Ensemble + Isotonic     │ Ensemble (OOB) │   0.4365    │    0.9438     │    0.7406    │      2.64%       │
│      │ Calibration             │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #3  │ Ensemble + Residual     │ Tuned stacking │   0.4250    │    0.9524     │    0.7471    │      2.66%       │
│      │ Stacking (Tuned LGBM)   │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #4  │ Ensemble + Residual     │ Tuned stacking │   0.4249    │    0.9525     │    0.7472    │      2.66%       │
│      │ Stacking (Tuned LGBM) + │                │             │               │              │                  │
│      │ Calibration             │                │             │               │              │                  │
├──────┼─────────────────────────┼────────────────┼─────────────┼───────────────┼──────────────┼──────────────────┤
│  #5  │ Ensemble + Residual     │ Ensemble (OOB) │   0.1229    │    1.1290     │    0.8985    │      3.21%       │
│      │ Stacking (Ridge)        │                │             │               │              │                  │
└──────┴─────────────────────────┴────────────────┴─────────────┴───────────────┴──────────────┴──────────────────┘
  • No improvement from enhancements: None of the enhancement strategies managed to improve upon the unenhanced one. In fact, all of them resulted in a slight to significant degradation in performance.

  • Isotonic calibration offers no benefit: The calibrated model's performance is statistically identical to the unenhanced version, indicating the minor biases we observed at the temperature extremes were not systematic enough across all cross-validation folds for the calibrator to learn a consistently useful correction.

  • Residual stacking degrades performance: Attempting to model the residuals made the model worse, showing that the residuals from the base ensemble are already close to random noise. The secondary "error model" was likely overfitting to the noise in the training folds, adding more variance than signal.

So yeah, performance ceiling probably reached for the chosen ensemble model.

We will now put these two models (WeightedAvg_Ridge_TunedLGBM and RidgeCV) to the test (set) and see how they fare:)

Q5.8. Final Evaluation on Test Set

Q5.8. Final Evaluation on Test Set

We train the 2 models 100% of the training data, and evaluate their performance exactly once on the held-out test set.

The results represent the final, unbiased estimate of how these models would perform in a real-world, production scenario on unseen data.

In [44]:
final_models_to_evaluate = {
    "WeightedAvg_Ridge_TunedLGBM": {
        "type": "weighted_averaging_tuned",
    },
    "RidgeCV": {
        "type": "oob",
        "feature_set": "linear",
    },
}

final_predictions = train.load_and_predict_on_test(
    models_to_predict=final_models_to_evaluate,
    X_test_linear=X_test_linear,
    X_test_tree=X_test_tree,
    horizons=config.HORIZONS,
    load_dir="./assets/models/daily",
)

eval.display_final_performance_table(
    predictions=final_predictions,
    y_test=y_test,
    display_names={
        "WeightedAvg_Ridge_TunedLGBM": "Ensemble",
        "RidgeCV": "RidgeCV",
    },
)
                           Final Model Performance on Unseen Test Set                            
┏━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃         ┃ Ensemble ┃ RidgeCV ┃ Ensemble ┃ RidgeCV  ┃ Ensemble ┃ RidgeCV ┃ Ensemble ┃ RidgeCV  ┃
┃ Horizon ┃  (R² ↑)  ┃ (R² ↑)  ┃ (RMSE ↓) ┃ (RMSE ↓) ┃ (MAE ↓)  ┃ (MAE ↓) ┃ (MAPE ↓) ┃ (MAPE ↓) ┃
┡━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│   t+1   │  0.7460  │ 0.7396  │  0.7912  │  0.8010  │  0.6265  │ 0.6324  │  2.19%   │  2.21%   │
├─────────┼──────────┼─────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│   t+2   │  0.6306  │ 0.6232  │  0.9555  │  0.9651  │  0.7628  │ 0.7725  │  2.66%   │  2.71%   │
├─────────┼──────────┼─────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│   t+3   │  0.5805  │ 0.5748  │  1.0195  │  1.0264  │  0.8130  │ 0.8144  │  2.84%   │  2.85%   │
├─────────┼──────────┼─────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│   t+4   │  0.5689  │ 0.5658  │  1.0336  │  1.0374  │  0.8225  │ 0.8231  │  2.87%   │  2.88%   │
├─────────┼──────────┼─────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│   t+5   │  0.5711  │ 0.5648  │  1.0325  │  1.0401  │  0.8246  │ 0.8299  │  2.88%   │  2.90%   │
├─────────┼──────────┼─────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ Average │  0.6194  │ 0.6136  │  0.9665  │  0.9740  │  0.7699  │ 0.7745  │  2.69%   │  2.71%   │
└─────────┴──────────┴─────────┴──────────┴──────────┴──────────┴─────────┴──────────┴──────────┘
  • The ensemble model consistently outperforms the alternate: Achieves a higher R² score and lower error (RMSE, MAE, MAPE) than the RidgeCV model across every single forecast horizon. It's narrow, but consistent.

  • No signs of overfitting: Both models perform significantly better on the test set than their average cross-validation scores.

    • Ensemble R²: 0.6194 (Test) vs. 0.4416 (Weighted CV)
    • RidgeCV R²: 0.6136 (Test) vs. 0.4219 (OOB CV)
    • TimeSeriesSplit provided a conservative, robust estimate of performance. The final models, trained on 100% of the training data, are more powerful and perform better on the immediately subsequent test period.
  • A deeper look at the metrics:

    • The ensemble's average R² of 0.6194 indicates it successfully explains nearly 62% of the temperature variance on unseen data, which is a decent result for a noisy, real-world time series, with the amount of data available here.
    • The average RMSE of 0.9665°C means our ensemble's forecasts are, on average, off by less than one degree Celsius, providing a tangible measure of its accuracy.
    • The MAPE of 2.69% confirms that the magnitude of this error is small relative to the actual temperature values.
  • Practical trade-offs:

    • While the horizon-weighted ensemble is the clear winner in raw performance, the RidgeCV model's performance is remarkably close.
    • The ensemble achieved its slim victory through a complex, multi-stage tuning and ensembling process. The RidgeCV model, by contrast, is far simpler, trains in (mili)seconds, and is more directly interpretable.
    • It validates RidgeCV as an exceptional simple but powerful mode. In a production scenario where speed and simplicity are prioritized over the last drop of a percentage point in accuracy, RidgeCV would be an outstanding choice. For a scenario demanding the absolute highest accuracy, the ensemble would be better.
In [45]:
model_name = "WeightedAvg_Ridge_TunedLGBM"
horizon_to_plot = "t+1"

y_true_plot = y_test[f"target_temp_{horizon_to_plot}"]
y_pred_plot = final_predictions[model_name][horizon_to_plot]

fig_ts_dashboard = diag.plot_performance_timeseries_dashboard(
    y_true=y_true_plot,
    y_pred=y_pred_plot,
    model_name="Champion Ensemble",
    horizon=horizon_to_plot,
    # add_rolling_mean=7,
)
fig_ts_dashboard.show()
  • Good tracking performance: Top panel shows model's predictions (red dotted line) track the actual temperature (blue line) with a decent accuracy, successfully capturing both the large-scale seasonal trends and the majority of the day-to-day fluctuations.

  • Slight conservative bias: Tendency to be slightly conservative, under-predicting the absolute peaks of heatwaves and over-predicting the sharpest cold snaps - a common characteristic of robust, regularized models that are trained to prioritize signal over noise. While we did try to improve extreme cases like these (Q.5.7), it proved to be unfruitful.

  • Random, well-behaved errors: The residual plot (bottom panel) confirms that the model's errors are, for the most part, randomly distributed around the zero line, with no obvious patterns or seasonal biases. The vast majority of errors fall within a tight +/- 2.5°C band, which is a decent result.

In [46]:
fig_stat_dashboard = diag.plot_performance_statistical_dashboard(
    y_true=y_true_plot,
    y_pred=y_pred_plot,
    model_name="Champion Ensemble",
    horizon=horizon_to_plot,
)

fig_stat_dashboard.show()
  • Good calibration: The "Actual vs. Predicted" scatter plot (left) shows a tight, linear cloud of points closely following the ideal y=x line ,indicating model is well-calibrated across the entire temperature range.

  • Unbiased and (pretty) normally distributed errors:

    • The distribution almost peaked at an error of 0°C, confirming the model is fundamentally unbiased.
    • The shape of the errors (red KDE line) closely mirrors a theoretical normal distribution (dotted line), meaning the model captured most of the predictive signals and leave out the random noises.

$\implies$ Model conclusion: Accurate (high R², low RMSE) and statistically robust - predictions are well-calibrated, errors are unbiased, random, and normally distributed.

Q6. UI to demo our app

Q6. UI to demo our app

This section addresses a critical component of delivering a machine learning product: creating an intuitive and informative user interface (UI) for demonstration and interaction. The final step is to transition our predictive models from a research notebook into an operational tool that visualizes our findings and showcases the model's capabilities in a user-friendly web application.

6.1 UI Development Tools and Process

6.1 UI Development Tools and Process

To build the UI, we chose a standard, robust, and universally compatible stack of web technologies. This ensures the application is lightweight, fast, and can be easily hosted on any modern static web platform.

  • Structure (HTML): We used semantic HTML5 to build the skeleton of the application. The project is structured as a multi-page site, with separate, well-organized files for the main forecast app (index.html), the detailed daily view (detail.html), and the technical project report (about.html).

  • Styling (CSS): Cascading Style Sheets (CSS) were used to implement the visual design. We followed modern CSS practices, including Flexbox and Grid for layout, to create a fully responsive interface that adapts seamlessly from desktop to mobile screens. The design language is clean and minimalist, inspired by modern weather applications, using a "frosted glass" effect to ensure readability while maintaining an immersive feel.

  • Interactivity (JavaScript): Vanilla JavaScript was used to power all the dynamic features of the application. The scripts handle tasks such as:

    • Asynchronously fetching prediction data from local JSON files using the fetch() API.
    • Dynamically creating and updating HTML elements to display data.
    • Managing user interactions, such as handling clicks on the date picker and navigation buttons.
    • Implementing client-side logic to update the UI based on the selected date and forecast horizon.

6.2 Data Integration for the UI

6.2 Data Integration for the UI

A crucial aspect of this UI is that it is a true demonstration of our model's performance on unseen data. It is not using random or placeholder numbers.

  • Data Source: The entire application is powered by two JSON files (rich_data.json and hourly_predictions.json) that were programmatically exported from our Jupyter Notebook at the end of the project.

  • Test Set Predictions Only: The data contained in these files represents the predictions made by our champion models on the held-out test set only. This ensures that what the user sees on the website is a genuine and honest representation of our model's real-world performance.

  • Contextual Data: To provide a rich user experience (e.g., for icons and detailed metrics like humidity), the model's temperature predictions were merged with the actual historical data from the corresponding future date in the test set. This creates a powerful visualization, allowing a direct comparison between our model's temperature forecast and the weather conditions that actually occurred. This process is methodologically sound and leak-free, as the historical context was only joined after the model had already made its blind predictions.

6.3 Final UI Showcase

6.3 Final UI Showcase

The final application consists of three main views: the Main Forecast Page, the Detailed Daily View, and the "About our Project" page.

Main Forecast Page (index.html)

This is the primary user interface, designed for quick and easy access to the weather forecast. It features a date picker, navigation arrows, and a 6-day forecast (the selected day + the next 5 days), all floating over a dynamic background that changes with the weather.

Detailed Daily View (detail.html)

When a user clicks on any day from the main page, they are taken to this detailed dashboard. It presents a comprehensive, two-panel breakdown of the forecast for that specific day. This includes key metrics (Humidity, Wind, Precipitation, etc.) with custom data visualizations and an interactive, 24-hour temperature graph that showcases the predicted hourly fluctuations.

About Our Project Page (about.html)

This page serves as a full technical report of the project, presented as an interactive, step-by-step journey controlled by navigation arrows. It contains all the key charts, tables, and conclusions from our analysis—from Data Sourcing through Modeling and Retraining strategies—allowing users to explore the complete methodology behind the model.

No description has been provided for this image Click to see the future!

Q7. When Should the Model Be Retrained?

Q7. When Should the Model Be Retrained?

This section addresses a critical MLOps question: how do we design a robust, automated strategy for monitoring our models in a simulated production environment and deciding when a retrain is necessary? A static model's performance will inevitably degrade over time as the real-world climate it's trying to predict evolves.

7.1. Understanding Model Drift in Production

7.1. Understanding Model Drift in Production

Models degrade after deployment because underlying physical, behavioral, and seasonal dynamics evolve. In weather forecasting this manifests as model drift and, if unaddressed, steadily increases error.

Drift Categories¶

1. Concept Drift Feature → target relationships shift (e.g., humidity–temperature coupling changes).

2. Data Drift (Covariate Shift) Input feature distributions shift (e.g., long‑term humidity levels trend lower/higher).

3. Label Drift (Prior/Target Shift) Target distribution shifts (e.g., gradual warming shifts mean temperature upward).

Impact¶

Drift causes:

  • Higher RMSE / MAE (forecast degradation)
  • Lower R² (lost explanatory power)
  • Misleading operational decisions (energy planning, agriculture, etc.)

$\implies$ We need a systematic detector + retraining policy rather than ad‑hoc manual intervention.

What We Do Next¶

We introduce an EnsembleDriftDetector with explicit thresholds on performance, feature distributions, and target stability; then simulate static vs adaptive strategies to answer: "When should the model be retrained?"

7.2. Automated Drift Detection System

7.2. Automated Drift Detection System

We deploy an EnsembleDriftDetector to continuously assess concept, data, and label drift using stored baseline distributions and validation metrics.

Baselines Captured¶

  • Performance: baseline_rmse, baseline_r2
  • Feature distributions: each training column's raw values
  • Target distribution: training temperature values

Detection Methods & Thresholds¶

| Drift Type | Method | Threshold | |------------|--------|-------------------| | Concept Drift | RMSE ratio + R² drop | RMSE > 1.30 × baseline OR R² drop > 0.20 | | Data Drift | KS test per feature | p < 0.001 for > 10 features | | Label Drift | KS test + mean shift | p < 0.001 AND |shift| > 2.5°C |

Note: These are initial thresholds; the EnsembleDriftDetector implements a more nuanced 3-tier priority system, as seen in utils/drift_detection.py

Trigger Logic¶

Retrain if:

  • High Priority: Severe performance degradation is detected.
  • Medium Priority: Moderate performance degradation OR (Data drift AND Label drift) are both detected.
  • Low Priority: Scheduled time-based trigger (e.g., 90 days) is reached.

This structured approach balances sensitivity (catching early degradation) and specificity (avoiding unnecessary retrains).

7.3. Simulation Experiment Design

7.3. Simulation Experiment Design

Objective: Quantify the benefit of adaptive retraining versus a single frozen deployment under drift conditions.

We compare two strategies over a 180‑day test window:

Strategy 1: Static Ensemble (Baseline)¶

  • Load pre-trained horizon models (h1–h5).
  • Generate multi-horizon forecasts daily.
  • Never retrain; track gradual performance decay.

Strategy 2: Adaptive Ensemble (Proposed)¶

  • Start from same initial models.
  • Every 30 days: run drift check on most recent slice.
  • If trigger condition met → retrain models on expanded training window.
  • Continue forecasting with updated set.

Evaluation Metrics¶

  • Per-horizon and averaged: RMSE, MAE, R².
  • Retrain events & timing.
  • Error trajectories (rolling windows).

$\implies$ This reveals whether monitoring + retraining overhead is justified by retained accuracy and stability.

In [47]:
# Load pre-trained ensemble models for each horizon
sim_ensemble = {f"h{i}": joblib.load(f"assets/models/daily/WeightedAvg_Ridge_TunedLGBM_h{i}.pkl") for i in range(1, 6)}

# Use the model-ready data that already has all engineered features
df_sim = df_model_ready.copy()

# Split into train/test using the same approach as the main modeling section
n_total = len(df_sim)
n_test = int(n_total * config.TEST_SIZE)
n_train = n_total - n_test

# Load the feature selection mask for linear models
selected_features_mask = joblib.load("./assets/saved_results/selected_features_mask.pkl")
linear_feature_cols = [col for i, col in enumerate(feature_cols) if selected_features_mask[i]]
tree_feature_cols = feature_cols  # Use all features for tree models

# Prepare full training data (will be expanded in the adaptive strategy)
X_train_full_sim = df_sim.iloc[:n_train]
y_train_full_sim = df_sim.iloc[:n_train]

# Prepare the 180-day test slice for the simulation
X_test_sim = df_sim.iloc[n_train : n_train + 180]
y_test_sim = df_sim.iloc[n_train : n_train + 180]

# Pre-slice feature sets for convenience
X_test_sim_linear = X_test_sim[linear_feature_cols]
X_test_sim_tree = X_test_sim[tree_feature_cols]

print(f"Initial training data size: {len(X_train_full_sim)} days")
print(f"Simulation test data size: {len(X_test_sim)} days")
Initial training data size: 3310 days
Simulation test data size: 180 days

7.4. Static Ensemble Predictions (Baseline)

7.4. Static Ensemble Predictions (Baseline)

We first establish a non‑adaptive baseline:

  • Load pre-trained weighted horizon ensemble models (WeightedAvg_Ridge_TunedLGBM_h1..h5).
  • Generate daily forecasts for each horizon over the 180‑day test slice.
  • Collect per-horizon errors (RMSE, MAE, R²) and aggregate averages.
  • No drift checks; no retraining; serves as performance floor.

This baseline will be contrasted against the adaptive retraining strategy to measure retained accuracy and responsiveness under evolving conditions.

In [48]:
static_results = dd.run_static_prediction(
    ensemble_models=sim_ensemble,
    X_test_linear=X_test_sim_linear,
    X_test_tree=X_test_sim_tree,
    y_test=y_test_sim,
)

# Store the baseline performance metrics
static_predictions = static_results["predictions"]
static_rmse = static_results["avg_metrics"]["rmse"]
static_r2 = static_results["avg_metrics"]["r2"]
Making predictions for all 5 horizons (Static Baseline)...

  Predicting horizon h1 (t+1 days ahead)...
  Predicting horizon h2 (t+2 days ahead)...
  Predicting horizon h3 (t+3 days ahead)...
  Predicting horizon h4 (t+4 days ahead)...
  Predicting horizon h5 (t+5 days ahead)...

============================================================
✓ Static Ensemble Performance by Horizon:
============================================================
Horizon      RMSE (°C)       MAE (°C)        R²        
------------------------------------------------------------
t+1          0.7625          0.6154          0.7152    
t+2          0.9418          0.7491          0.5652    
t+3          1.0482          0.8452          0.4608    
t+4          1.0811          0.8714          0.4264    
t+5          1.0642          0.8612          0.4469    
------------------------------------------------------------
Average      0.9796          0.7885          0.5229    
============================================================

7.5. Adaptive Prediction Simulation

7.5. Adaptive Prediction Simulation

Here we implement the intelligent retraining strategy:

  • Start with the same pre-trained ensemble.
  • Check for drift every 30 days using our 3-tier priority EnsembleDriftDetector.
  • Retrain all 5 horizon models when a drift trigger condition is met.
  • Continue predictions with the newly updated model.

Key difference: The model adapts to changing patterns by incorporating recent observations, aiming to stay calibrated to current conditions and mitigate performance decay.

In [49]:
# --- 1. Initialize the Drift Detector ---
detector = dd.EnsembleDriftDetector(baseline_rmse=static_rmse, baseline_r2=static_r2)
# Set the training data distributions as the reference
detector.set_baseline_distributions(
    X_train_full_sim[tree_feature_cols],
    y_train_full_sim["target_temp_t+1"],
)

# --- 2. Run the Adaptive Simulation ---
# adaptive_results = dd.run_adaptive_simulation(
#     initial_models=sim_ensemble,
#     drift_detector=detector,
#     X_train_full=X_train_full_sim,
#     y_train_full=y_train_full_sim,
#     X_test_full=X_test_sim,
#     y_test_full=y_test_sim,
#     linear_feature_cols=linear_feature_cols,
#     tree_feature_cols=tree_feature_cols,
# )
adaptive_results = joblib.load("./assets/saved_results/q7_adaptive_results.pkl")

# --- 3. Display and Interpret Results ---
adaptive_rmse = adaptive_results["avg_metrics"]["rmse"]
adaptive_r2 = adaptive_results["avg_metrics"]["r2"]
retrain_dates = adaptive_results["retrain_dates"]
retrain_priorities = adaptive_results["retrain_priorities"]
✓ Drift Detector initialized (3-TIER PRIORITY SYSTEM)
  Baseline RMSE: 0.9796°C
  Baseline R²: 0.5229

  Priority Thresholds:
  🔴 HIGH (Immediate):  RMSE>1.5× OR R²drop>0.3
  🟡 MEDIUM (Proactive): Data+Label drift OR moderate performance drop
  🟢 LOW (Scheduled):    90 days without retrain
✓ Stored 139 feature distributions

7.6. Visual Analysis & Interpretation

7.6. Visual Analysis & Interpretation

Visual analysis of the 3-tier priority-based adaptive retraining strategy compared to static deployment:

  1. Overall metrics comparison - Static vs. Adaptive performance across RMSE, MAE, and R².
  2. Performance by horizon - How adaptive retraining improves accuracy for each forecast horizon (t+1 through t+5).
  3. Improvement percentages - Quantifying the value of adaptive retraining by horizon and metric.
  4. Error timeline with priority markers - When did retraining events occur, at what priority level (🔴 HIGH / 🟡 MEDIUM / 🟢 LOW), and how did they impact prediction errors?
  5. Rolling performance trends - 30-day rolling RMSE showing performance evolution and the impact of priority-based retraining events.

These visualizations reveal the practical impact of implementing a hierarchical drift detection system for production deployment.

In [50]:
dd.plot_simulation_results(
    static_results=static_results,
    adaptive_results=adaptive_results,
    y_test=y_test_sim,
)

7.7. Final Answer: The Retraining Policy

7.7. Final Answer: The Retraining Policy

Question: When should the multi‑horizon temperature forecasting ensemble be retrained to sustain production performance?

$\implies$ Implement a 3-tier priority system, checking for drift every 30 days. Retrain immediately on HIGH priority events (severe performance drops), proactively on MEDIUM priority events (combined data/label drift), and automatically on LOW priority events (scheduled maintenance).

This policy moves beyond a simple, single-condition trigger to a more intelligent, risk-based approach.

Key Evidence from Simulation (180-Day Test)¶

Our simulation provides definitive, quantitative proof that this adaptive policy is superior to a static, "deploy-and-forget" model:

  • Overall Performance Gain: The adaptive strategy reduced the average RMSE by +1.56% and increased the average R² by +2.81%.
  • Consistent Improvement: The adaptive model outperformed the static model at every single forecast horizon (t+1 through t+5).
  • Error Suppression: Visual analysis of the 30-day rolling RMSE shows the adaptive model consistently suppressing error decay. After each retraining event, the model's performance recovers or stabilizes, while the static model's performance continues to degrade.

The Recommended Retraining Policy in Detail¶

This policy is implemented by the EnsembleDriftDetector class.

Priority Trigger Condition Thresholds (from our simulation) Action
🔴 HIGH Severe Performance Degradation: A sudden, unacceptable drop in accuracy. RMSE Ratio > 1.50 OR R² Drop > 0.30 Immediate Retrain
🟡 MEDIUM Moderate Performance Drop OR Combined Distributional Drift RMSE Ratio > 1.30 OR R² Drop > 0.20 OR (Data Drift AND Label Drift) Proactive Retrain
🟢 LOW Scheduled Maintenance: The model has not been retrained for an extended period. Days since last retrain ≥ 90 Scheduled Retrain

Data Drift: >10 features with KS test p < 0.001. Label Drift: Target KS test p < 0.001 AND |mean shift| > 2.5°C.

Why This Policy is Effective¶

  1. Balances Reactivity and Stability: It reacts instantly to critical failures (HIGH priority) but requires corroborating evidence for more subtle changes (MEDIUM priority), preventing over-reaction to transient noise.
  2. Resource Efficiency: It limits retraining to events where a genuine regime shift is detected, avoiding the computational cost and operational churn of a fixed, frequent retraining schedule (e.g., daily).
  3. Guarantees Freshness: The time-based trigger (LOW priority) acts as a safety net, ensuring the model never becomes excessively stale, even in the absence of dramatic drift.
  4. Explainability & Auditability: Each retraining decision is tied to a specific priority level and a clear reason (e.g., "HIGH Priority - CRITICAL performance degradation"), providing a clear audit trail.

Practical Recommendations¶

  • Log Everything: Log the outputs of every drift check (metrics, priority level, and final decision) for governance and future analysis.
  • Update Baselines: After retraining on new data, the performance and distribution baselines must be recalculated and stored for the detector to use going forward.
  • Review Thresholds Periodically: The thresholds themselves are hyperparameters. They should be reviewed quarterly or semi-annually to ensure they are not causing too many false positives (unnecessary retrains) or false negatives (missed drift).

Final Conclusion: Retrain intelligently, not indiscriminately. Use a multi-tiered drift detection system to trigger retraining only when there is data-driven evidence of a meaningful and persistent shift in performance or underlying data distributions.

Q8. Hourly Weather Data Forecasting Analysis

Q8. Hourly Weather Data Forecasting Analysis

This section addresses a key question: Can leveraging high-frequency (hourly) data improve our final daily temperature forecast? We explore two distinct, competing philosophies to answer this:

  1. Strategy 1: Aggregate-then-Predict: We first distill the rich hourly data into a powerful set of aggregated daily features, then train our daily forecasting models on this enriched dataset.
  2. Strategy 2: Predict-then-Aggregate: We build models to forecast the temperature at an hourly resolution, then aggregate these high-frequency predictions into daily averages.

By comparing both strategies to our original daily-data champion, we can make a definitive, data-driven conclusion on the practical value of higher-frequency data for this problem.

8.1. Strategy 1: Aggregate-then-Predict

8.1. Strategy 1: Aggregate-then-Predict

The philosophy here is to create a superior daily dataset. We use the hourly data to engineer a wide array of features—capturing not just daily averages, but also intra-day volatility, timing of extremes, and other dynamics—that are impossible to derive from daily data alone.

8.1.1. Load & Clean Hourly Data

8.1.1. Load & Clean Hourly Data

In [51]:
df_hourly_clean = hp.load_and_clean_hourly_data(config.HOURLY_DATA_PATH)
Raw hourly data loaded: (94248, 25)
Dropped 28 duplicate rows.
✓ Data cleaning complete. No remaining NaNs.

8.1.2. Aggregate & Engineer Daily Features

8.1.2. Aggregate & Engineer Daily Features

Next, we execute the core of this strategy: resampling the cleaned hourly data to a daily frequency and then applying the full feature engineering pipeline (temporal, cyclical, lags, rolling windows).

In [52]:
df_s1_features = hp.create_daily_enriched_features(df_hourly_clean)
Daily aggregation complete. Shape: (3927, 51)
✓ Feature engineering complete. Final shape: (3927, 154)

8.1.3. Model Training & Evaluation

8.1.3. Model Training & Evaluation

In [53]:
# This function handles data splitting, LASSO, model training, and evaluation.
# results_s1_df = hp.run_strategy_1_evaluation(
#     df_features=df_s1_features,
#     test_size=config.TEST_SIZE
# )
results_s1_df = joblib.load("./assets/saved_results/results_s1_df.pkl")

table = Table(title="[bold]Strategy 1 Results: Aggregate-then-Predict[/bold]")
table.add_column("Model", style="cyan", no_wrap=True)
table.add_column("Avg R² ↑", style="green", justify="right")
table.add_column("Avg RMSE ↓", style="yellow", justify="right")
table.add_column("Avg MAE ↓", style="yellow", justify="right")

for _, row in results_s1_df.iterrows():
    table.add_row(
        row["Model"],
        f"{row['Avg_R2']:.4f}",
        f"{row['Avg_RMSE']:.4f}°C",
        f"{row['Avg_MAE']:.4f}°C"
    )
console.print(table)
      Strategy 1 Results: Aggregate-then-Predict      
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Model          ┃ Avg R² ↑ ┃ Avg RMSE ↓ ┃ Avg MAE ↓ ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Champion_Daily │   0.6195 │   0.9664°C │  0.7698°C │
│ Q8.1_Ensemble  │   0.6117 │   0.9762°C │  0.7730°C │
│ Q8.1_RidgeCV   │   0.5969 │   0.9944°C │  0.7870°C │
└────────────────┴──────────┴────────────┴───────────┘

8.2. Strategy 2: Predict-then-Aggregate

8.2. Strategy 2: Predict-then-Aggregate

This strategy takes a fundamentally different approach. Instead of summarizing the data first, we build models that forecast at the highest available resolution—hourly.

The process involves:

  1. Extensive Hourly Feature Engineering: Creating a rich feature set where each row represents a single hour, with features describing its temporal context and recent history (e.g., temperature 24 hours ago).
  2. Multi-Model Hourly Forecasting: Training 5 specialist multi-output models. Each model is responsible for predicting a 24-hour block of the future (e.g., Model 1 predicts hours 1-24, Model 2 predicts hours 25-48).
  3. Aggregation of Predictions: The resulting 120 hourly predictions are then aggregated into 5 daily averages to be compared with our daily target.

8.2.1. Hourly Feature Engineering & Data Preparation

8.2.1. Hourly Feature Engineering & Data Preparation

We start again from the cleaned hourly data, but this time, we engineer features at the hourly level, creating an extensive set of lags and rolling windows that capture both recent (e.g., 3-hour average) and diurnal (e.g., 24-hour lag) patterns.

In [54]:
# This function handles the hourly feature engineering, 120-hour target creation, and data splitting.
s2_data = hp.prepare_strategy_2_data(
    df_hourly_clean=df_hourly_clean,
    test_start_timestamp=pd.to_datetime("2024-02-20"),
)
Hourly feature engineering complete. Shape: (94220, 95)
Data prepared for Strategy 2. Train shape: (79993, 215), Test shape: (14035, 215)

8.2.2. Multi-Model Training & Evaluation

8.2.2. Multi-Model Training & Evaluation

Using the prepared hourly data, we now train our 5 specialist models. Each model uses a lightly-tuned LGBMRegressor wrapped in a MultiOutputRegressor to predict its 24-hour forecast window. The final hourly predictions are then aggregated and evaluated.

In [55]:
# This function runs the entire S2 pipeline: trains 5 models, predicts, aggregates, and evaluates.
# s2_results = hp.run_strategy_2_evaluation(
#     X_train=s2_data["X_train"],
#     y_train=s2_data["y_train"],
#     X_test=s2_data["X_test"],
#     y_test=s2_data["y_test"],
#     day_targets=s2_data["day_targets"],
# )

# results_s2_df = pd.DataFrame([s2_results])
results_s2_df = joblib.load("./assets/saved_results/results_s2_df.pkl")

final_hourly_results_df = pd.concat([results_s1_df, results_s2_df], ignore_index=True)
final_hourly_results_df = final_hourly_results_df.sort_values(by="Avg_R2", ascending=False)

table = Table(title="[bold]Strategy 2 Results: Predict-then-Aggregate[/bold]")
table.add_column("Model", style="cyan", no_wrap=True)
table.add_column("Avg R² ↑", style="green", justify="right")
table.add_column("Avg RMSE ↓", style="yellow", justify="right")
table.add_column("Avg MAE ↓", style="yellow", justify="right")

for _, row in final_hourly_results_df.iterrows():
    table.add_row(
        row["Model"],
        f"{row['Avg_R2']:.4f}",
        f"{row['Avg_RMSE']:.4f}°C",
        f"{row['Avg_MAE']:.4f}°C"
    )
console.print(table)
          Strategy 2 Results: Predict-then-Aggregate          
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Model                  ┃ Avg R² ↑ ┃ Avg RMSE ↓ ┃ Avg MAE ↓ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Champion_Daily         │   0.6195 │   0.9664°C │  0.7698°C │
│ Q8.2_Hourly_MultiModel │   0.6185 │   0.9436°C │  0.7569°C │
│ Q8.1_Ensemble          │   0.6117 │   0.9762°C │  0.7730°C │
│ Q8.1_RidgeCV           │   0.5969 │   0.9944°C │  0.7870°C │
└────────────────────────┴──────────┴────────────┴───────────┘
  • Champion Prevails (on R²): The original Champion_Daily model, with its extensive per-horizon tuning, narrowly holds the top spot for R² (0.6195), indicating it remains the best at explaining the overall variance in temperature.

  • Strategy 2 Excels (on Error): The Q8.2_Hourly_MultiModel (Predict-then-Aggregate) is a very close second in R² (0.6185) but is the decisive winner on error metrics, achieving the lowest Avg RMSE (0.9436°C) and Avg MAE (0.7570°C). This means its forecasts are, on average, more accurate.

  • Strategy 1 is a Viable Alternative: The Q8.1_Ensemble (Aggregate-then-Predict) is the third-best performer. While not top-tier, it remains highly competitive without any HPO, confirming it's a robust and simpler approach.

R² vs. RMSE Trade-off:

  1. Why did the Champion_Daily win on R²?

    • Cleaner Signal: The daily data averages out intra-day noise, providing a more stable signal.
    • The Power of HPO: The more extensive, per-horizon hyperparameter optimization allowed the model to extract the maximum possible signal from the clean daily data, explaining the most variance.
  2. Why did the Q8.2_Hourly_MultiModel win on RMSE/MAE?

    • Fine-Grained Patterns: By modeling the full hourly sequence, this strategy captured subtle intra-day dynamics (e.g., the exact timing and shape of the daily temperature curve) that were lost during the pre-aggregation of Strategy 1.
    • Better Error Averaging: While predicting any single hour is noisy, the errors across 24 hours tend to cancel each other out. This leads to a more accurate final daily average, even if the model is less confident about the overall trend (lower R²).
  3. The Overarching Theme: Tuning is Untapped Potential

    • Both hourly strategies were tested with untuned or lightly-tuned models. The fact that the Q8.2_Hourly_MultiModel came so close to the fully-optimized champion—and beat it on error metrics—is a testament to its raw potential.
    • It is highly probable that applying the same rigorous HPO campaign to the 5 specialist hourly models would allow this strategy to surpass the Champion_Daily on all metrics.

$\implies$ For predicting daily average temperature:

  • Highest R² (Best for explaining variance): The original Champion_Daily model, leveraging a clean signal and extensive HPO, remains superior.
  • Lowest Error (Most Accurate Forecasts): The Q8.2_Hourly_MultiModel (Predict-then-Aggregate) is the best strategy, as it produces predictions with the lowest average error in degrees Celsius.

Recommendation: For a production system where minimizing the absolute forecast error (RMSE) is the primary goal, the Predict-then-Aggregate strategy is the superior choice and should be the focus of future optimization efforts (i.e., HPO). If interpretability and explaining variance are more important, the daily data approach remains the champion.

8.2.3. Per-Hour Predictions Evaluations

8.2.3. Per-Hour Predictions Evaluations

While the aggregated daily performance is our primary metric for comparison, analyzing the model's performance at its native hourly resolution provides critical diagnostic insights. This deep dive answers three key questions:

  1. Overall Accuracy: What is the typical error for a single hourly forecast?
  2. Performance Decay: How quickly does our forecast accuracy degrade as we predict further into the future?
  3. Systematic Biases: Is the model systematically better or worse at predicting certain times of the day (e.g., afternoon heat vs. overnight lows)?
In [56]:
hourly_analysis_results = hp.run_hourly_deep_dive_analysis(
    X_test=s2_data["X_test"],
    y_test=s2_data["y_test"],
    day_targets=s2_data["day_targets"],
    model_path_template="./assets/models/hourly/hourly_model_day_{day_num}.pkl",
)
--- Overall Average Hourly Performance ---
  - Avg. Hourly RMSE: 1.5465°C
  - Avg. Hourly R²:   0.7282
  • Performance Decay: The model's error increases sharply for the first ~12 hours, then stabilizes. This defines the reliable window for high-precision hourly forecasts. Beyond 12-24 hours, the model predicts a stable, climatologically average day rather than specific hourly fluctuations.
  • Intra-day Bias: There is a clear, systematic bias. The model is most accurate during overnight and early morning hours (lowest RMSE at 1-2 AM) and least accurate during late afternoon/evening hours (peak RMSE at 9 PM). This indicates a struggle to accurately predict the daily temperature peak and subsequent cooling.

9. Improving Deployment Efficiency with ONNX

9. Improving Deployment Efficiency with ONNX

9.1. Model Deployment And Its Challenges From Research To Production

9.1. Model Deployment And Its Challenges From Research To Production

After successfully developing, training, and evaluating our temperature forecasting models, the final step is to transition them from experimental research into a real-world, production environment. This process, known as model deployment, marks the movement from theoretical model to an operational tool designed to deliver practical value in life.

However, at this stage, model performance is assessed not only through accuracy metrics such as $R^2$, but also through factors such as inference speed, operational stability, and ease of integration of models into existing systems. Therefore, deploying a model from a research environment into a production application often encounters some challenges:

  • Dependency, Integration, and Platform Complexity: Models are usually trained using specific frameworks and libraries, such as PyTorch, Scikit-learn, or LightGBM in Python. Deploying them requires replicating these dependencies in the production environment, which can be heavy, complex, and incompatible with existing technology stacks (e.g., Java, C#, or C++). Additionally, models built in one language or framework cannot be directly integrated into systems using a different language or platform, creating deployment challenges across heterogeneous environments.
  • Hardware and Platform Diversity: Production environments are rarely homogeneous. A model may need to operate across a range of systems, from high-performance cloud servers with GPUs to resource-limited edge devices or different operating systems. Models saved in their native framework format are often not optimized for such variability.
  • Performance Limitations and Inference Requirements: Frameworks are typically designed to prioritize flexibility during training rather than maximizing inference efficiency. For real-time forecasting, low latency and high throughput are essential, and standard model formats may not achieve the necessary performance.

$\implies$ To optimize machine learning models for efficient deployment, ONNX (Open Neural Network Exchange) serves as the solution by packaging models into an open, standardized format that can run flexibly across multiple platforms while maintaining high performance and stability.

9.2. ONNX Study

9.2. ONNX Study

9.2.1. What is ONNX?

9.2.1. What is ONNX?

  • ONNX (Open Neural Network Exchange) is an open-source format designed to represent machine learning models in a standardized way. It allows models developed and trained in one framework (such as scikit-learn, LightGBM, or PyTorch) to be easily converted and deployed across a variety of platforms and environments. By serving as a bridge between different AI tools, ONNX supports both traditional machine learning and deep learning models.

  • In order to facilitate cross-platform model deployment, the ONNX framework is made up of three primary components:

    • ONNX Format: A protobuf-based file format (.onnx) that stores model architecture, weights, and metadata in a standardized structure.
    • ONNX Runtime: A high-performance engine that executes .onnx model files efficiently on various hardware and operating systems. This eliminates dependency on the original training framework and ensures flexible, scalable model deployment.
    • Conversion Tools: Libraries such as skl2onnx and onnxmltools that convert framework-specific models to the ONNX format, enabling cross-platform compatibility.

9.2.2. Why ONNX Improves Deployment Efficiency?

9.2.2. Why ONNX Improves Deployment Efficiency?

Challenge ONNX Advantage
Interoperability Convert both RidgeCV (Scikit-learn) and LGBMRegressor (LightGBM) into .onnx format → deployment only requires ONNX Runtime, not multiple library dependencies.
Performance ONNX Runtime, written in C++, supports hardware acceleration (e.g., AVX, GPU), enabling significantly faster inference compared to using .predict() in Python.
Cross-Platform Deployment Although training models in Python, we can deploy them in C#, C++, Java, or JavaScript environments. Easily deploy models to edge devices, mobile, and IoT systems with hardware constraints, or to cloud platforms (Azure ML, AWS, GCP) using the standardized ONNX format.
Production Performance & Scalability Ideal for real-time inference where high speed and low latency are required; ONNX Runtime optimizes for production-grade workloads.
Team Collaboration & Framework Independence Different teams using different frameworks can share models through the ONNX standard. This avoids vendor lock-in and allows flexible switching between frameworks and tools.
Model Optimization Tools like quantization reduce model size and improve inference speed — ideal for edge and resource-limited devices.

9.3. ONNX Application to Ho Chi Minh City Temperature Forecasting Project

9.3. ONNX Application to Ho Chi Minh City Temperature Forecasting Project

9.3.1. Loading Models and Test Data

9.3.1. Loading Models and Test Data

In [57]:
ridge_model = joblib.load("./assets/models/daily/RidgeCV_h1.pkl")
data_input_dir = "./assets/saved_results/step_9_data"
file_X_test_linear = os.path.join(data_input_dir, "X_test_linear.joblib")
file_X_test_tree = os.path.join(data_input_dir, "X_test_tree.joblib")

X_test_linear = joblib.load(file_X_test_linear)
X_test_tree = joblib.load(file_X_test_tree)

print(f"Shape of X_test_linear: {X_test_linear.shape}")
print(f"Shape of X_test_tree: {X_test_tree.shape}")
Shape of X_test_linear: (584, 41)
Shape of X_test_tree: (584, 139)

9.3.2. Converting the Simple Alternative (RidgeCV)

9.3.2. Converting the Simple Alternative (RidgeCV)

As a standard Scikit-learn estimator, RidgeCV model conversion process is straightforward and serves as an excellent baseline for understanding the ONNX workflow and its benefits.

The conversion process involves three main steps: defining the model's input format, running the converter, and saving the result to a file.

In [58]:
# --- 1. Define the input format for the model ---
initial_type_ridge = [("float_input", FloatTensorType([None, X_test_linear.shape[1]]))]

# ---2. Convert the Scikit-learn model to the ONNX format ---
onnx_model_ridge = skl2onnx.convert_sklearn(
    ridge_model,
    initial_types=initial_type_ridge,
    target_opset=12,  # Stable compatibility
)

# --- 3. Save the ONNX model to a file ---
onnx_filename_ridge = "./assets/models/onnx/ridge_forecaster.onnx"
with open(onnx_filename_ridge, "wb") as f:
    f.write(onnx_model_ridge.SerializeToString())

After successfully converting the RidgeCV model to ONNX format, the next critical step is to validate the conversion's correctness and measure its performance benefits. This verification process ensures that the ONNX model produces identical predictions to the original Scikit-learn model, while the benchmark quantifies the actual speedup gained from the conversion.

The verification workflow consists of four distinct phases: setting up the testing environment, verifying prediction accuracy, benchmarking inference speed, and generating a comprehensive performance comparison.

In [59]:
benchmark_results = {}

# Prepare test data (ONNX Runtime requires float32)
X_test_linear_np = X_test_linear.astype(np.float32).values

# Load the ONNX Runtime session for the converted RidgeCV model.
sess_ridge = rt.InferenceSession(onnx_filename_ridge)
input_name_ridge = sess_ridge.get_inputs()[0].name
output_name_ridge = sess_ridge.get_outputs()[0].name
In [60]:
# Run prediction with the original Scikit-learn model.
pred_original_ridge = ridge_model.predict(X_test_linear_np)

# Run prediction with the ONNX model.
pred_onnx_ridge = sess_ridge.run([output_name_ridge], {input_name_ridge: X_test_linear_np})[0]

# Compare predictions with numerical tolerance
np.testing.assert_allclose(pred_original_ridge, pred_onnx_ridge.flatten(), rtol=1e-5, atol=1e-5)
In [61]:
# Use %timeit with the '-o' flag to capture the timing result as an object.
print("Timing original Scikit-learn model (.pkl):")
t_original_ridge = %timeit -n 100 -r 7 -o ridge_model.predict(X_test_linear_np)
benchmark_results["RidgeCV (.pkl)"] = t_original_ridge

print("\nTiming ONNX Runtime model (.onnx):")
t_onnx_ridge = %timeit -n 100 -r 7 -o sess_ridge.run([output_name_ridge], {input_name_ridge: X_test_linear_np})
benchmark_results["RidgeCV (ONNX)"] = t_onnx_ridge


processed_data = []
for name, result in benchmark_results.items():
    model_name, model_type = name.split(" ")

    # Calculate throughput (predictions per second)
    throughput = len(X_test_linear_np) / result.average
    processed_data.append({
        "Model": model_name,
        "Deployment Type": model_type.strip("()"),
        # result.average is in seconds; convert to microseconds (µs) for readability.
        "Avg. Time (µs)": result.average * 1_000_000,
        "Std. Dev. (µs)": result.stdev * 1_000_000,
        "Throughput (predictions/sec)": throughput,
    })

df_results = pd.DataFrame(processed_data)

# Calculate the speedup factor relative to the .pkl baseline.
baseline_time = df_results.loc[0, "Avg. Time (µs)"]
onnx_time = df_results.loc[1, "Avg. Time (µs)"]
df_results.loc[1, "Speedup"] = baseline_time / onnx_time
df_results.loc[0, "Speedup"] = 1.0


result_table = (
    df_results.style.format({
        "Avg. Time (µs)": "{:,.1f}".format,
        "Std. Dev. (µs)": "± {:,.1f}".format,
        "Speedup": "{:,.2f}x".format,
        "Throughput (predictions/sec)": "{:,.0f}".format,
    })
    .set_properties(**{"text-align": "center"})
    .set_table_styles([dict(selector="th", props=[("text-align", "center")])])
)

print("--- Deployment Performance ---")
display(result_table)
Timing original Scikit-learn model (.pkl):
249 μs ± 40.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing ONNX Runtime model (.onnx):
57.6 μs ± 8.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
--- Deployment Performance ---
  Model Deployment Type Avg. Time (µs) Std. Dev. (µs) Throughput (predictions/sec) Speedup
0 RidgeCV .pkl 249.4 ± 40.3 2,342,028 1.00x
1 RidgeCV ONNX 57.6 ± 8.1 10,144,122 4.33x

Results Analysis¶

The performance comparison reveals significant efficiency gains when deploying the RidgeCV model in ONNX format compared to the traditional Scikit-learn .pkl format.

Key Findings¶
  • Inference Speed: The ONNX model demonstrates a 3.57x speedup over the original Scikit-learn model, reducing average prediction time from 350.5 µs to 98.1 µs. This acceleration is necessary for building low-latency applications, such as real-time APIs, where every millisecond counts. At scale, this level of efficiency translates directly to lower infrastructure costs and a more responsive user experience.

  • Throughput Capacity: The ONNX deployment achieves approximately 5,955,824 pred/s, compared to 1,666,377 pred/s for the .pkl format. This 3.5x increase in throughput capacity enables the model to handle significantly higher request volumes without additional infrastructure.

  • Consistency: Both models exhibit low standard deviations (± 29.2 µs for .pkl and ± 7.8 µs for ONNX), indicating stable and predictable inference times. Notably, the ONNX model shows substantially lower variance (73% reduction in standard deviation), suggesting more consistent performance characteristics and better reliability under varying system conditions.

$\implies$ Conclusion: The discovery into ONNX has proven highly successful.. After converting the RidgeCV model to the ONNX format, we obtain a model that is about 3.6 times faster, requires far fewer dependencies, and can run on multiple platforms. Importantly, it still produces predictions that are exactly the same as the original model. These findings confirm that ONNX is a valuable and effective solution for improving machine learning models when preparing them for real-world production use.

Important Notes¶

Benchmark Variability: The timing results presented here are specific to this particular execution and test environment. Performance metrics will vary across different runs due to factors such as:

  • System load and background processes
  • CPU thermal throttling and turbo boost states
  • Memory cache states and page faults
  • Operating system scheduling decisions

$\implies$ For production deployment decisions, it is recommended to run multiple benchmark sessions across different conditions and calculate confidence intervals to establish a reliable performance baseline.

9.3.3. Handling the Custom Ensemble Champion Model (WeightedAvg_Ridge_TunedLGBM)

9.3.3. Handling the Custom Ensemble Champion Model (WeightedAvg_Ridge_TunedLGBM)

Our champion model, WeightedAvg_Ridge_TunedLGBM, is not a single, standard machine-learning model but a custom Python class that acts as a full prediction pipeline. This pipeline works by using two different base models, RidgeCV and TunedLGBM, and training each of them to make a prediction using their own feature sets. After both predictions are conducted, the class applies a weighted-average formula to combine them into one final output.

This design gives the model strong predictive performance, but it also means that the overall structure is not a pure mathematical graph. The ensemble relies on procedural logic written in Python, including method calls, control flow, and custom weighting rules. Because of this, conversion tools such as skl2onnx cannot convert the entire ensemble into a single .onnx file. These tools are built to translate the mathematical computation graph of a model, not arbitrary Python code.

$\implies$ Therefore, a direct conversion of the entire ensemble object into a single .onnx file is not feasible.

Work-around if we need to convert our ensemble to ONNX:

  • Convert the base models (RidgeCV and tuned LightGBM)
  • When serving inference, serve and get both models' predictions
  • Put the predictions through a pipeline that applies the weighted sum (the per-horizon weight of each model) to get the final prediction, and return that result for the client

$\implies$ This increases complexity but preserves the ensemble logic, allowing the use of the model through ONNX should the advantages of ONNX outweighs the disadvantages of increased complexity.

In [62]:
"./ui/lasso_sunburst_chart.html"
Out[62]:
'./ui/lasso_sunburst_chart.html'